thegenemyers / MERQURY.FK

FastK based version of Merqury
Other
20 stars 3 forks source link

FastK with more threads produces lower Phred with Merqury.FK #9

Open jelber2 opened 2 years ago

jelber2 commented 2 years ago

Hi,

I am not sure about this issue as I am using a non-standard installation of FastK (https://github.com/davebx/FASTK/commit/305d01b81204f6870c034b9abd9d8c280d4d4b76), but maybe it applies to the current production FastK #4604bfc?

Ultimately, the quality value estimate from MerquryFK is much lower when more threads are used (I have not exhaustively tried different number of threads). Note that the reads and reference (below) were made with the rust-bio-tools's bam-anonymize (https://github.com/rust-bio/rust-bio-tools) from real E coli PacBio HiFi reads aligned to an E coli reference [one that seemed rather divergent from the strain being sequenced].

get reads (~11 MB gzipped) and reference (~1.5MB gzipped)

wget https://www.dropbox.com/s/9et7bq9k4nc9cf7/anonymous-reference.fasta.gz
wget https://www.dropbox.com/s/j4r7cwf6dtdi3nr/anonymous-reads2.fasta.gz

default threads

FastK -t1 -p -Nanonymous2 anonymous-reads2
MerquryFK -f -pdf -T34 -P./ anonymous2 anonymous-reference anonymous2
cat anonymous2.qv 

Assembly    No Support  Total   Error % QV
anonymous-reference 901 4641612 0.0005  53.1

More than default number of threads

FastK -T34 -t1 -p -Nanonymous3 anonymous-reads2
MerquryFK -f -pdf -T34 -P./ anonymous3 anonymous-reference anonymous3
cat anonymous3.qv 

Assembly    No Support  Total   Error % QV
anonymous-reference 137658  4641612 0.0752  31.2

Any help would be greatly appreciated. For now, I would only use the default number of threads/cores.

thegenemyers commented 2 years ago

I would suggest just running FastK with different numbers of threads and seeing if the k-mer histogram are different. That should be sufficient to at least understand if the problem is with FastK (or you version thereof) or something downstream of the k-mer counting. Best, Gene

On 3/31/22, 2:44 PM, Jean Elbers wrote:

Hi,

I am not sure about this issue as I am using a non-standard installation of FastK @.*** https://github.com/davebx/FASTK/commit/305d01b81204f6870c034b9abd9d8c280d4d4b76), but maybe it applies to the current production FastK #4604bfc https://github.com/thegenemyers/FASTK/commit/4604bfcdfd9251d05b27fbd5aef38187e9a9c9ad?

Ultimately, the quality value estimate from MerquryFK is much lower when more threads are used (I have not exhaustively tried different number of threads). Note that the reads and reference (below) were made with the rust-bio-tools's bam-anonymize (https://github.com/rust-bio/rust-bio-tools) from real E coli PacBio HiFi reads aligned to an E coli reference [one that seemed rather divergent from the strain being sequenced].

get reads (~11 MB gzipped) and reference (~1.5MB gzipped)

wget https://www.dropbox.com/s/9et7bq9k4nc9cf7/anonymous-reference.fasta.gz wget https://www.dropbox.com/s/j4r7cwf6dtdi3nr/anonymous-reads2.fasta.gz

default threads

FastK -t1 -p -Nanonymous2 anonymous-reads2 MerquryFK -f -pdf -T34 -P./ anonymous2 anonymous-reference anonymous2 cat anonymous2.qv Assembly No Support Total Error % QV anonymous-reference 901 4641612 0.0005 53.1

More than default number of threads

FastK -T34 -t1 -p -Nanonymous3 anonymous-reads2 MerquryFK -f -pdf -T34 -P./ anonymous3 anonymous-reference anonymous3 cat anonymous3.qv Assembly No Support Total Error % QV anonymous-reference 137658 4641612 0.0752 31.2

Any help would be greatly appreciated. For now, I would only use the default number of cores.

— Reply to this email directly, view it on GitHub https://github.com/thegenemyers/MERQURY.FK/issues/9, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUSINQNJ6UGFGSU5ZTS3GTVCWM4RANCNFSM5SFILRAQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jelber2 commented 2 years ago

Ok, so I have tested to see if there are differences between -T1....-T34

Generate histograms for -T1...-T34

for i in `seq 1 34`
do
  FastK -T${i} -t1 -p -Nanonymous2 anonymous-reads2
  Histex -h1:100 anonymous2.hist > ${i}.txt;done
done

Are the histograms different?

for i in `seq 1 34`
do
  diff 1.txt ${i}.txt
done

no output

Ok, when I run Merqury.FK with 32 threads (2 minus the number of cores used by FastK), I get the "correct" QV estimate.

MerquryFK -f -pdf -T34 -P./ anonymous3 anonymous-reference anonymous3

MerquryFK -f -pdf -T33 -P./ anonymous3 anonymous-reference anonymous3-1

MerquryFK -f -pdf -T32 -P./ anonymous3 anonymous-reference anonymous3-2

cat anonymous3.qv 

Assembly    No Support  Total   Error % QV
anonymous-reference 137658  4641612 0.0752  31.2

cat anonymous3-1.qv 

Assembly    No Support  Total   Error % QV
anonymous-reference 141776  4641612 0.0775  31.1

cat anonymous3-2.qv

Assembly    No Support  Total   Error % QV
anonymous-reference 901 4641612 0.0005  53.1