jpuritz / dDocent

a bash pipeline for RAD sequencing
ddocent.com
MIT License
53 stars 41 forks source link

Simple Suggestions To Increase Speed of Freebayes #58

Open cbird808 opened 4 years ago

cbird808 commented 4 years ago

suggestion

apply a minimum coverage filter to the cov.stats prior to making the mapped.*.bed files, which will ultimately reduce the number of contigs genotyped

the minimum coverage value

either the number of individuals or 2x the number of individuals. There's really not much point in evaluating a contig if there's not at least 1 read per allele per locus per individual.

speedup

2x in my current situation, plus the time saved later in filtering

implementation

# calculate 2 x NumInd - 1
minCOV=$(echo $(($(wc -l namelist | cut -d" " -f1) * 2 - 1)))
# make file of contigs with low coverage
mawk -v minCOV=$minCOV '$4 < minCOV {print $1}' cov.stats | uniq > low.cov.contigs
# isolate contigs with high coverage
grep -f low.cov.contigs -vF cov.stats > low.cov.stats
# apply changes to cov.stats
mv low.cov.stats cov.stats
# cleanup intermediate files
rm low.cov.contigs
chollenbeck commented 4 years ago

This does help a lot. I've also been removing intervals that are clearly too large or small to be RAD fragments, which further reduces the variant calling overhead.

cbird808 commented 4 years ago

Hey Chris,

Can you post your code for that?

I'm am additionally experimenting with shuffling the cov.stats rather than sorting them. Will let you all know how that pans out.

I'm afraid to apply the contig splitting measure because of all the new error handling code for freebayes. Not sure if the error handling code is related to the splitting, but I don't get any errors with the older code for making the mapped.*.bed files with no splitting and my own parallelization scheme.

The other thing I'm doing is dividing up the individuals among nodes and genotyping them separately with the intention of combining the vcf files after some mild filtering to cut down file size. I'm jumping off the cliff without looking to see if there's a smooth landing, but I'm in a rush and there's open nodes...

cbird808 commented 4 years ago

It might be too early to tell, but I think that this is evidence that shuffling the cov.stats is effective at evenly distributing load across all of the threads:

-rw-r--r-- 1 cbird hobi  17M Nov 14 04:13 raw.12.5.5.vcf
-rw-r--r-- 1 cbird hobi  18M Nov 14 04:13 raw.10.5.5.vcf
-rw-r--r-- 1 cbird hobi  15M Nov 14 04:13 raw.19.5.5.vcf
-rw-r--r-- 1 cbird hobi  18M Nov 14 04:13 raw.8.5.5.vcf
-rw-r--r-- 1 cbird hobi  19M Nov 14 04:13 raw.2.5.5.vcf
-rw-r--r-- 1 cbird hobi  17M Nov 14 04:14 raw.18.5.5.vcf
-rw-r--r-- 1 cbird hobi  14M Nov 14 04:14 raw.11.5.5.vcf
-rw-r--r-- 1 cbird hobi  14M Nov 14 04:14 raw.14.5.5.vcf
-rw-r--r-- 1 cbird hobi  19M Nov 14 04:14 raw.4.5.5.vcf
-rw-r--r-- 1 cbird hobi  19M Nov 14 04:14 raw.3.5.5.vcf
-rw-r--r-- 1 cbird hobi  18M Nov 14 04:14 raw.16.5.5.vcf
-rw-r--r-- 1 cbird hobi  16M Nov 14 04:14 raw.9.5.5.vcf
-rw-r--r-- 1 cbird hobi  16M Nov 14 04:14 raw.15.5.5.vcf
-rw-r--r-- 1 cbird hobi  18M Nov 14 04:14 raw.20.5.5.vcf
-rw-r--r-- 1 cbird hobi  16M Nov 14 04:14 raw.5.5.5.vcf
-rw-r--r-- 1 cbird hobi  17M Nov 14 04:14 raw.6.5.5.vcf
-rw-r--r-- 1 cbird hobi  18M Nov 14 04:14 raw.7.5.5.vcf
-rw-r--r-- 1 cbird hobi  17M Nov 14 04:14 raw.17.5.5.vcf
-rw-r--r-- 1 cbird hobi  17M Nov 14 04:14 raw.1.5.5.vcf
-rw-r--r-- 1 cbird hobi  20M Nov 14 04:14 raw.13.5.5.vcf
-rw-r--r-- 1 cbird hobi  24M Nov 14 04:19 raw.20.5.5.vcf
-rw-r--r-- 1 cbird hobi  22M Nov 14 04:19 raw.2.5.5.vcf
-rw-r--r-- 1 cbird hobi  23M Nov 14 04:19 raw.19.5.5.vcf
-rw-r--r-- 1 cbird hobi  24M Nov 14 04:19 raw.16.5.5.vcf
-rw-r--r-- 1 cbird hobi  19M Nov 14 04:19 raw.3.5.5.vcf
-rw-r--r-- 1 cbird hobi  25M Nov 14 04:19 raw.8.5.5.vcf
-rw-r--r-- 1 cbird hobi  26M Nov 14 04:19 raw.17.5.5.vcf
-rw-r--r-- 1 cbird hobi  22M Nov 14 04:20 raw.7.5.5.vcf
-rw-r--r-- 1 cbird hobi  18M Nov 14 04:20 raw.11.5.5.vcf
-rw-r--r-- 1 cbird hobi  23M Nov 14 04:20 raw.13.5.5.vcf
-rw-r--r-- 1 cbird hobi  20M Nov 14 04:20 raw.10.5.5.vcf
-rw-r--r-- 1 cbird hobi  22M Nov 14 04:20 raw.5.5.5.vcf
-rw-r--r-- 1 cbird hobi  18M Nov 14 04:20 raw.15.5.5.vcf
-rw-r--r-- 1 cbird hobi  24M Nov 14 04:20 raw.4.5.5.vcf
-rw-r--r-- 1 cbird hobi  20M Nov 14 04:20 raw.14.5.5.vcf
-rw-r--r-- 1 cbird hobi  25M Nov 14 04:20 raw.6.5.5.vcf
-rw-r--r-- 1 cbird hobi  26M Nov 14 04:20 raw.12.5.5.vcf
-rw-r--r-- 1 cbird hobi  20M Nov 14 04:20 raw.18.5.5.vcf
-rw-r--r-- 1 cbird hobi  25M Nov 14 04:20 raw.9.5.5.vcf
-rw-r--r-- 1 cbird hobi  22M Nov 14 04:20 raw.1.5.5.vcf

To apply this change, replace

 | sort -V -k1,1 -k2,2 |

With

 | shuf |

I'm pretty sure that code is not repeated elsewhere.

After the raw*vcf are created, they need to be sorted prior to combining

if [ ! -d "raw.vcf" ]; then
    mkdir raw.vcf
fi

#sort the vcf
ls raw.*.vcf | parallel --no-notice -j $NUMProc "cat <(grep -v '^dDocent' {}) <(grep '^dDocent' {} | sort -V -k1,1 -k2,2) > ./raw.vcf/{} "
rm raw.*.vcf

echo ""; echo " "`date` "Assembling final VCF file..."
vcfcombine ./raw.vcf/raw.*.vcf | sed -e 's/ \.\:/   \.\/\.\:/g' > TotalRawSNPs.vcf

To reiterate, this should solve the following type of behavior where the first contigs finish much sooner than the others and then lead to wasted processing power:

-rw-r--r-- 1 cbird hobi  32M Nov 14 04:02 raw.1.5.5.vcf
-rw-r--r-- 1 cbird hobi  46M Nov 14 04:07 raw.2.5.5.vcf
-rw-r--r-- 1 cbird hobi  63M Nov 14 04:14 raw.4.5.5.vcf
-rw-r--r-- 1 cbird hobi  52M Nov 14 04:15 raw.3.5.5.vcf
-rw-r--r-- 1 cbird hobi  72M Nov 14 04:19 raw.5.5.5.vcf
-rw-r--r-- 1 cbird hobi  87M Nov 14 04:25 raw.6.5.5.vcf
-rw-r--r-- 1 cbird hobi 135M Nov 14 04:26 raw.11.5.5.vcf
-rw-r--r-- 1 cbird hobi 202M Nov 14 04:28 raw.16.5.5.vcf
-rw-r--r-- 1 cbird hobi 105M Nov 14 04:28 raw.7.5.5.vcf
-rw-r--r-- 1 cbird hobi 178M Nov 14 04:28 raw.15.5.5.vcf
-rw-r--r-- 1 cbird hobi 113M Nov 14 04:28 raw.10.5.5.vcf
-rw-r--r-- 1 cbird hobi 146M Nov 14 04:28 raw.13.5.5.vcf
-rw-r--r-- 1 cbird hobi 248M Nov 14 04:28 raw.18.5.5.vcf
-rw-r--r-- 1 cbird hobi 165M Nov 14 04:28 raw.14.5.5.vcf
-rw-r--r-- 1 cbird hobi 313M Nov 14 04:28 raw.20.5.5.vcf
-rw-r--r-- 1 cbird hobi 143M Nov 14 04:28 raw.12.5.5.vcf
-rw-r--r-- 1 cbird hobi 120M Nov 14 04:28 raw.9.5.5.vcf
-rw-r--r-- 1 cbird hobi 227M Nov 14 04:28 raw.17.5.5.vcf
-rw-r--r-- 1 cbird hobi 123M Nov 14 04:28 raw.8.5.5.vcf
-rw-r--r-- 1 cbird hobi 294M Nov 14 04:28 raw.19.5.5.vcf
jpuritz commented 4 years ago

suggestion

apply a minimum coverage filter to the cov.stats prior to making the mapped.*.bed files, which will ultimately reduce the number of contigs genotyped

the minimum coverage value

either the number of individuals or 2x the number of individuals. There's really not much point in evaluating a contig if there's not at least 1 read per allele per locus per individual.

speedup

2x in my current situation, plus the time saved later in filtering

implementation

# calculate 2 x NumInd - 1
minCOV=$(echo $(($(wc -l namelist | cut -d" " -f1) * 2 - 1)))
# make file of contigs with low coverage
mawk -v minCOV=$minCOV '$4 < minCOV {print $1}' cov.stats | uniq > low.cov.contigs
# isolate contigs with high coverage
grep -f low.cov.contigs -vF cov.stats > low.cov.stats
# apply changes to cov.stats
mv low.cov.stats cov.stats
# cleanup intermediate files
rm low.cov.contigs

Please feel free to submit this as a pull request. This seems reasonable.

jpuritz commented 4 years ago

I'm not sure about shuffling the cov.stats file. Yes, this would distribute the load more evenly, but it achieves this by splitting up high coverage contigs and by breaking up the first 100 contigs or so which tend to be the most abundant loci. Line 462 already works to create calling intervals that are even across coverage. The TT variable could be made smaller on line 460 to generate smaller calling intervals, but this also comes at a cost because each initialization of FreeBayes is costly.

Also, the sorting solution presented only works for de novo references.

pdimens commented 3 years ago

I tested this on my real-world linkage map data and the method suggested by @cbird808 removes all of my contigs:

> minCOV=$(echo $(($(wc -l namelist | cut -d" " -f1) * 2 - 1)))                     

> mawk -v minCOV=$minCOV '$4 < minCOV {print $1}' cov.stats | uniq > low.cov.contigs

> wc -l low.cov.contigs                                                             
5135 low.cov.contigs

> grep -f low.cov.contigs -vF cov.stats > low.cov.stats

> wc -l low.cov.stats 
0 low.cov.stats

this is the cov.stats. file:

> head -50 cov.stats
Talbacares_1    1495    1771    37
Talbacares_1    1795    2058    32
Talbacares_1    2864    3285    11
Talbacares_1    3785    3924    4
Talbacares_1    3957    4118    36
Talbacares_1    4129    4194    6
Talbacares_1    4205    4484    26
Talbacares_1    5412    5740    9
Talbacares_1    5764    6082    21
Talbacares_1    6409    6633    2
Talbacares_1    6998    7284    12
Talbacares_1    7307    7477    17
Talbacares_1    7485    7806    62
Talbacares_1    7819    8055    24
Talbacares_1    8066    8362    21
Talbacares_1    8531    8781    8
Talbacares_1    8831    9168    42
Talbacares_1    9525    9846    27
Talbacares_1    10104   10239   6
Talbacares_1    10263   10491   10
Talbacares_1    11277   11616   36
Talbacares_1    11640   12142   48
Talbacares_1    12154   12477   50
Talbacares_1    12536   12787   30
Talbacares_1    12798   13044   14
Talbacares_1    16584   16831   27
Talbacares_1    16842   17203   50
Talbacares_1    17227   17446   10
Talbacares_1    17616   17977   49
Talbacares_1    17984   18284   25
Talbacares_1    20320   20600   114
Talbacares_1    20611   20838   18
Talbacares_1    20959   21523   109
Talbacares_1    21547   21837   18
Talbacares_1    21881   22119   51
Talbacares_1    22125   22355   6
Talbacares_1    22837   23289   50
Talbacares_1    26759   26999   9
Talbacares_1    27023   27368   10
Talbacares_1    27392   27492   2
Talbacares_1    28138   28620   66
Talbacares_1    28636   28976   11
Talbacares_1    29983   30319   42
Talbacares_1    30330   30560   24
Talbacares_1    31494   31768   46
Talbacares_1    31791   32129   95
Talbacares_1    32153   32364   9
Talbacares_1    32391   33084   457
Talbacares_1    33190   33506   32
Talbacares_1    33530   33625   4
jpuritz commented 3 years ago

It seems odd that you coverage is so low.

pdimens commented 3 years ago

Story of my life. My data is janky b/c we had to whole-genome-amplify all the samples prior to using ddRAD and there's a noticeable amount of bias from PCR and missing data.

pdimens commented 3 years ago

To follow up on this, the filtering should be an opt-in or opt-out to avoid situations like above.

jpuritz commented 3 years ago

Perhaps, we can put in something logical like > 1, but yes anything more should be optional.

Jon Puritz, PhD

Assistant Professor Department of Biological Sciences University of Rhode Island 120 Flagg Road, Kingston, RI 02881

Webpage: MarineEvoEco.com

Cell: 401-338-8739 Work: 401-874-9020

"The most valuable of all talents is that of never using two words when one will do.” -Thomas Jefferson

On Wed, Jun 09, 2021 at 9:31 AM, Pavel V. Dimens @.***> wrote:

To follow up on this, the filtering should be an opt-in or opt-out to avoid situations like above.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jpuritz/dDocent/issues/58#issuecomment-857696660, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABE5CR3OAQ7QZREOXKKVV53TR5ULFANCNFSM4JNG2Z7A .

pdimens commented 3 years ago

That seems like a safe compromise.