ratschlab / metagraph

Scalable annotated de Bruijn graphs for DNA indexing, alignment, and assembly
http://metagraph.ethz.ch
GNU General Public License v3.0
110 stars 17 forks source link

editing fastq header lines changes coordinate indexing #476

Closed matt-shenton closed 11 months ago

matt-shenton commented 11 months ago

HI there, I have a query or potential bug. It may be my poor understanding of how fastq is read.

I'm building a graph and annotating with coordinates based on scripts in this repository:

Here's the main part of the script:

` metagraph build -v -p 8 -k ${ksize} --disk-swap ./${tempdir} --mem-cap-gb 20 -o ${outdir}/${sample}/${sample} ${readdir}/${sample}${readlength}R1.fastq.gz ${readdir}/${sample}${readlength}_R2.fastq.gz 2> ${outdir}/${sample}/logs/build.log

metagraph annotate -v -p 8 --disk-swap ./${tempdir} -i ${outdir}/${sample}/${sample}.dbg --mem-cap-gb 20 --anno-filename --coordinates -o ${outdir}/${sample}/columns/${sample} --separately ${readdir}/${sample}${readlength}R1.fastq.gz ${readdir}/${sample}${readlength}_R2.fastq.gz 2> ${outdir}/${sample}/logs/annotate.log

find ${outdir}/${sample}/columns/${sample} -name ".column.annodbg" | metagraph transform_anno -v -p $numthreads --disk-swap ./${temp_dir} --coordinates --mem-cap-gb 36 --anno-type row_diff --row-diff-stage 0 -i ${outdir}/${sample}/${sample}.dbg -o ${outdir}/${sample}/rd_columns/out 2> ${outdir}/${sample}/logs/row_diff0.log find ${outdir}/${sample}/columns/${sample} -name ".column.annodbg" | metagraph transform_anno -v -p $numthreads --disk-swap ./${temp_dir} --coordinates --mem-cap-gb 36 --anno-type row_diff --row-diff-stage 1 -i ${outdir}/${sample}/${sample}.dbg -o ${outdir}/${sample}/rd_columns/out 2> ${outdir}/${sample}/logs/row_diff1.log find ${outdir}/${sample}/columns/${sample} -name "*.column.annodbg" | metagraph transform_anno -v -p $numthreads --disk-swap ./${temp_dir} --coordinates --mem-cap-gb 36 --anno-type row_diff --row-diff-stage 2 -i ${outdir}/${sample}/${sample}.dbg -o ${outdir}/${sample}/rd_columns/out 2> ${outdir}/${sample}/logs/row_diff2.log

find ${outdir}/${sample}/rd_columns -name "*.column.annodbg" | metagraph transform_anno -v -p $numthreads --disk-swap ./${temp_dir} --mem-cap-gb 36 --anno-type row_diff_brwt_coord --greedy --fast --subsample 10000000 -i ${outdir}/${sample}/${sample}.dbg -o ${outdir}/${sample}/annotation/annotation 2> ${outdir}/${sample}/logs/rdbrwt.log

` My issue is I tried to edit long fastq headers and make the third line minimal (only "+") to save disk space and for consistent header length. I used awk as:

fastp -i ${readdir}/${sample}_R1.fastq.gz --stdout -b ${readlength} -l ${readlength} --disable_quality_filtering --disable_trim_poly_g --dont_eval_duplication --disable_adapter_trimming | awk '{if(NR%4==1) {printf "@%020d 1 Hime\n", ++i} else if (NR%4==3) {print "+"} else {print}}' | bgzip > ${readdir}/${sample}_${readlength}_${suffix}_R1.fastq.gz with fastp to trim the reads for consistent length.

Reads only trimmed for length: @DRR240855.1 ST-E00104:1038:HVJ7MCCXY:1:1101:11464:1766 length=151 GCCTCTGCCGATTCGGCTTTCGCTGCGCCTCCCCCAATTCTGCCTCCGCTCTCGCTCGCTGGAGTGTCGGCATCCGTGCCAGTCGCATCTGGGGAAGAGGGGAGAAGGAGAATGATAGGAGATAGAGAAAGAGAGGGGATGAGGGGGAGGA +DRR240855.1 ST-E00104:1038:HVJ7MCCXY:1:1101:11464:1766 length=151 A<A<FJJJJ<<JJJJJJ-FJJJJAJJFJJFJJJJJJJJJJJJJJJFJJJAJAFJ7<JAJ7-7-F--<AAJ<FFAJFJA<7777<<AF<JJJJJJA7A-AFJJ<F-7AFJJJJFFF<AJFJJFAFAAAF-AAAA-AFJJAJJJJJ)AJ<AF) @DRR240855.2 ST-E00104:1038:HVJ7MCCXY:1:1101:12053:1766 length=151 TTTAATCATTTAGTCATAGACTAAATTAAATCATATAATATTTCTAGGATTTCGTTCGATATTTAGCTCCAATTTGAGTGAAACTTGAACCTAAATTCATCTAAATTCATAAGCTTTCCAATGATATAGAATTCACTAATTGATAAAATAT +DRR240855.2 ST-E00104:1038:HVJ7MCCXY:1:1101:12053:1766 length=151 AA<<AFJJJAJFJJJJJJJJJJJF-JJJJJJJJJJJJJ<AJJJF<A-FJJJJJJJJJJF-<J<<-FJJJJ-J--FAJJJF<<FFJJJ7JFFJFFF-AF-AAJJJJAJJFF--AAA--AFFFJJJFF<J-F7-7F-A-7FJJJAA7FF--A< @DRR240855.3 ST-E00104:1038:HVJ7MCCXY:1:1101:12378:1766 length=151 AAAGCAACTTTCTAAGATTTTTTCTTTTTGAAAACAAAACACACCGCTTAGCAGTTTGGGAAGCCGCACGTGAAAAATGAGGATGTTTTGATATCTTTCTCTTGCTGCGGAAAGCAGCATTAGGCATCAGGAGGACTATAGAACCTCCCCA +DRR240855.3 ST-E00104:1038:HVJ7MCCXY:1:1101:12378:1766 length=151 A<AF-JJJ<J-FFJJJFJFJJFJJJFJJJJJJJJJJJJFJJJJJFJJJJFAJJJAJJJJJ7FFJAJJJJF-7FFJJJJJJJJJ-AFFJFJJJJ<JAJJJFFJ--FJF<JF-7<FJA-A<A-AJFJFFJ<F<<)F7<AJJAAFJ<FF-)AFF

Reads edited: @00000000000000000001 1 Hime GCCTCTGCCGATTCGGCTTTCGCTGCGCCTCCCCCAATTCTGCCTCCGCTCTCGCTCGCTGGAGTGTCGGCATCCGTGCCAGTCGCATCTGGGGAAGAGGGGAGAAGGAGAATGATAGGAGATAGAGAAAGAGAGGGGATGAGGGGGAGGA + A<A<FJJJJ<<JJJJJJ-FJJJJAJJFJJFJJJJJJJJJJJJJJJFJJJAJAFJ7<JAJ7-7-F--<AAJ<FFAJFJA<7777<<AF<JJJJJJA7A-AFJJ<F-7AFJJJJFFF<AJFJJFAFAAAF-AAAA-AFJJAJJJJJ)AJ<AF) @00000000000000000002 1 Hime TTTAATCATTTAGTCATAGACTAAATTAAATCATATAATATTTCTAGGATTTCGTTCGATATTTAGCTCCAATTTGAGTGAAACTTGAACCTAAATTCATCTAAATTCATAAGCTTTCCAATGATATAGAATTCACTAATTGATAAAATAT + AA<<AFJJJAJFJJJJJJJJJJJF-JJJJJJJJJJJJJ<AJJJF<A-FJJJJJJJJJJF-<J<<-FJJJJ-J--FAJJJF<<FFJJJ7JFFJFFF-AF-AAJJJJAJJFF--AAA--AFFFJJJFF<J-F7-7F-A-7FJJJAA7FF--A< @00000000000000000003 1 Hime AAAGCAACTTTCTAAGATTTTTTCTTTTTGAAAACAAAACACACCGCTTAGCAGTTTGGGAAGCCGCACGTGAAAAATGAGGATGTTTTGATATCTTTCTCTTGCTGCGGAAAGCAGCATTAGGCATCAGGAGGACTATAGAACCTCCCCA + A<AF-JJJ<J-FFJJJFJFJJFJJJFJJJJJJJJJJJJFJJJJJFJJJJFAJJJAJJJJJ7FFJAJJJJF-7FFJJJJJJJJJ-AFFJFJJJJ<JAJJJFFJ--FJF<JF-7<FJA-A<A-AJFJFFJ<F<<)F7<AJJAAFJ<FF-)AFF

Next I query the database: ` metagraph query --query-coords -i Himenomochi_metagraph_k31_l151_noline/Himenomochi/Himenomochi.dbg -a Himenomochi_metagraph_k31_l151_noline/Himenomochi/annotation/annotation.row_diff_brwt_coord.annodbg OsjWaxyExon1.fa | awk -F"\t" '{gsub(/:/,"\n"); print}' >tnoline

metagraph query --query-coords -i Himenomochi_metagraph_k31_l151_line1plus3/Himenomochi/Himenomochi.dbg -a Himenomochi_metagraph_k31_l151_line1plus3/Himenomochi/annotation/annotation.row_diff_brwt_coord.annodbg OsjWaxyExon1.fa | awk -F"\t" '{gsub(/:/,"\n"); print}' >t1plus3 `

` diff tnoline t1plus3 1c1 < 0 OsativaWaxyExon1 <JRCreads/Himenomochi_151_noline_R1.fastq.gz>

0 OsativaWaxyExon1 <JRCreads/Himenomochi_151_line1plus3_R1.fastq.gz> 3c3 < 0-1192729028-1192729064

0-1192850028-1192850064 5,11c5,11 < 0-3323309369-3323309406 < 0-3323363214-3323363251 < 0-136935538-136935578 < 0-1351547923-1351547974 < 0-2796603452-2796603541 < 0-596094181-596094278 < 5-4234737672-4234737792

0-3323188369-3323188406 0-3323242214-3323242251 0-136814538-136814578 0-1351426923-1351426974 0-2796482452-2796482541 0-595973181-595973278 5-4234858672-4234858792 13,16c13,16 < 18-1407423358-1407423478 < 26-2473849235-2473849355 < 37-1517694046-1517694166 < 78-4419861139-4419861237

18-1407302358-1407302478 26-2473728235-2473728355 37-1517815046-1517815166 78-4419982139-4419982237 19c19 < 99-3973328909-3973329029

99-3973449909-3973450029 22,25c22,25 < 134-1722755045-1722755165 < 149-425347912-425348032 < 149-425484642-425484762 < 236-3917505196-3917505246

134-1722876045-1722876165 149-425363642-425363762 149-425468912-425469032 236-3917384196-3917384246 28c28 < 280-1268088107-1268088135 <JRCreads/Himenomochi_151_noline_R2.fastq.gz>

280-1268088107-1268088135 <JRCreads/Himenomochi_151_line1plus3_R2.fastq.gz>

`

So the indexes are different (but only for read1). However, when I tried only editing the third line for each read, the differences were in read2.

I'm confused! and probably have some slip in my code. Can you give me some advice? How are the fastq files read, and what are the header requirements? Many thanks for your help!

matt-shenton commented 11 months ago

Sorry for the formatting, don't know what happened there

matt-shenton commented 11 months ago

an update. Now I tried editing the reads with seqkit replace, and I still have the same results. So at least the problem is not my awk script. Reads prepared by seqkit replace look like:

@00000000000000000001 GCCTCTGCCGATTCGGCTTTCGCTGCGCCTCCCCCAATTCTGCCTCCGCTCTCGCTCGCTGGAGTGTCGGCATCCGTGCCAGTCGCATCTGGGGAAGAGGGGAGAAGGAGAATGATAGGAGATAGAGAAAGAGAGGGGATGAGGGGGAGGA + A<A<FJJJJ<<JJJJJJ-FJJJJAJJFJJFJJJJJJJJJJJJJJJFJJJAJAFJ7<JAJ7-7-F--<AAJ<FFAJFJA<7777<<AF<JJJJJJA7A-AFJJ<F-7AFJJJJFFF<AJFJJFAFAAAF-AAAA-AFJJAJJJJJ)AJ<AF) @00000000000000000002 TTTAATCATTTAGTCATAGACTAAATTAAATCATATAATATTTCTAGGATTTCGTTCGATATTTAGCTCCAATTTGAGTGAAACTTGAACCTAAATTCATCTAAATTCATAAGCTTTCCAATGATATAGAATTCACTAATTGATAAAATAT + AA<<AFJJJAJFJJJJJJJJJJJF-JJJJJJJJJJJJJ<AJJJF<A-FJJJJJJJJJJF-<J<<-FJJJJ-J--FAJJJF<<FFJJJ7JFFJFFF-AF-AAJJJJAJJFF--AAA--AFFFJJJFF<J-F7-7F-A-7FJJJAA7FF--A< @00000000000000000003 AAAGCAACTTTCTAAGATTTTTTCTTTTTGAAAACAAAACACACCGCTTAGCAGTTTGGGAAGCCGCACGTGAAAAATGAGGATGTTTTGATATCTTTCTCTTGCTGCGGAAAGCAGCATTAGGCATCAGGAGGACTATAGAACCTCCCCA + A<AF-JJJ<J-FFJJJFJFJJFJJJFJJJJJJJJJJJJFJJJJJFJJJJFAJJJAJJJJJ7FFJAJJJJF-7FFJJJJJJJJJ-AFFJFJJJJ<JAJJJFFJ--FJF<JF-7<FJA-A<A-AJFJFFJ<F<<)F7<AJJAAFJ<FF-)AFF

So maybe there is a requirement for a description field, or something else in the reads?

matt-shenton commented 11 months ago

Sorry for the previous post, I was getting a bit overwhelmed. I'll try to reformulate it when I can state the problem more clearly

best regards Matt