Closed matt-shenton closed 11 months ago
Sorry for the formatting, don't know what happened there
an update. Now I tried editing the reads with seqkit replace, and I still have the same results. So at least the problem is not my awk script. Reads prepared by seqkit replace look like:
@00000000000000000001 GCCTCTGCCGATTCGGCTTTCGCTGCGCCTCCCCCAATTCTGCCTCCGCTCTCGCTCGCTGGAGTGTCGGCATCCGTGCCAGTCGCATCTGGGGAAGAGGGGAGAAGGAGAATGATAGGAGATAGAGAAAGAGAGGGGATGAGGGGGAGGA + A<A<FJJJJ<<JJJJJJ-FJJJJAJJFJJFJJJJJJJJJJJJJJJFJJJAJAFJ7<JAJ7-7-F--<AAJ<FFAJFJA<7777<<AF<JJJJJJA7A-AFJJ<F-7AFJJJJFFF<AJFJJFAFAAAF-AAAA-AFJJAJJJJJ)AJ<AF) @00000000000000000002 TTTAATCATTTAGTCATAGACTAAATTAAATCATATAATATTTCTAGGATTTCGTTCGATATTTAGCTCCAATTTGAGTGAAACTTGAACCTAAATTCATCTAAATTCATAAGCTTTCCAATGATATAGAATTCACTAATTGATAAAATAT + AA<<AFJJJAJFJJJJJJJJJJJF-JJJJJJJJJJJJJ<AJJJF<A-FJJJJJJJJJJF-<J<<-FJJJJ-J--FAJJJF<<FFJJJ7JFFJFFF-AF-AAJJJJAJJFF--AAA--AFFFJJJFF<J-F7-7F-A-7FJJJAA7FF--A< @00000000000000000003 AAAGCAACTTTCTAAGATTTTTTCTTTTTGAAAACAAAACACACCGCTTAGCAGTTTGGGAAGCCGCACGTGAAAAATGAGGATGTTTTGATATCTTTCTCTTGCTGCGGAAAGCAGCATTAGGCATCAGGAGGACTATAGAACCTCCCCA + A<AF-JJJ<J-FFJJJFJFJJFJJJFJJJJJJJJJJJJFJJJJJFJJJJFAJJJAJJJJJ7FFJAJJJJF-7FFJJJJJJJJJ-AFFJFJJJJ<JAJJJFFJ--FJF<JF-7<FJA-A<A-AJFJFFJ<F<<)F7<AJJAAFJ<FF-)AFF
So maybe there is a requirement for a description field, or something else in the reads?
Sorry for the previous post, I was getting a bit overwhelmed. I'll try to reformulate it when I can state the problem more clearly
best regards Matt
HI there, I have a query or potential bug. It may be my poor understanding of how fastq is read.
I'm building a graph and annotating with coordinates based on scripts in this repository:
Here's the main part of the script:
` metagraph build -v -p 8 -k ${ksize} --disk-swap ./${tempdir} --mem-cap-gb 20 -o ${outdir}/${sample}/${sample} ${readdir}/${sample}${readlength}R1.fastq.gz ${readdir}/${sample}${readlength}_R2.fastq.gz 2> ${outdir}/${sample}/logs/build.log
metagraph annotate -v -p 8 --disk-swap ./${tempdir} -i ${outdir}/${sample}/${sample}.dbg --mem-cap-gb 20 --anno-filename --coordinates -o ${outdir}/${sample}/columns/${sample} --separately ${readdir}/${sample}${readlength}R1.fastq.gz ${readdir}/${sample}${readlength}_R2.fastq.gz 2> ${outdir}/${sample}/logs/annotate.log
find ${outdir}/${sample}/columns/${sample} -name ".column.annodbg" | metagraph transform_anno -v -p $numthreads --disk-swap ./${temp_dir} --coordinates --mem-cap-gb 36 --anno-type row_diff --row-diff-stage 0 -i ${outdir}/${sample}/${sample}.dbg -o ${outdir}/${sample}/rd_columns/out 2> ${outdir}/${sample}/logs/row_diff0.log find ${outdir}/${sample}/columns/${sample} -name ".column.annodbg" | metagraph transform_anno -v -p $numthreads --disk-swap ./${temp_dir} --coordinates --mem-cap-gb 36 --anno-type row_diff --row-diff-stage 1 -i ${outdir}/${sample}/${sample}.dbg -o ${outdir}/${sample}/rd_columns/out 2> ${outdir}/${sample}/logs/row_diff1.log find ${outdir}/${sample}/columns/${sample} -name "*.column.annodbg" | metagraph transform_anno -v -p $numthreads --disk-swap ./${temp_dir} --coordinates --mem-cap-gb 36 --anno-type row_diff --row-diff-stage 2 -i ${outdir}/${sample}/${sample}.dbg -o ${outdir}/${sample}/rd_columns/out 2> ${outdir}/${sample}/logs/row_diff2.log
find ${outdir}/${sample}/rd_columns -name "*.column.annodbg" | metagraph transform_anno -v -p $numthreads --disk-swap ./${temp_dir} --mem-cap-gb 36 --anno-type row_diff_brwt_coord --greedy --fast --subsample 10000000 -i ${outdir}/${sample}/${sample}.dbg -o ${outdir}/${sample}/annotation/annotation 2> ${outdir}/${sample}/logs/rdbrwt.log
` My issue is I tried to edit long fastq headers and make the third line minimal (only "+") to save disk space and for consistent header length. I used awk as:
fastp -i ${readdir}/${sample}_R1.fastq.gz --stdout -b ${readlength} -l ${readlength} --disable_quality_filtering --disable_trim_poly_g --dont_eval_duplication --disable_adapter_trimming | awk '{if(NR%4==1) {printf "@%020d 1 Hime\n", ++i} else if (NR%4==3) {print "+"} else {print}}' | bgzip > ${readdir}/${sample}_${readlength}_${suffix}_R1.fastq.gz
with fastp to trim the reads for consistent length.Reads only trimmed for length:
@DRR240855.1 ST-E00104:1038:HVJ7MCCXY:1:1101:11464:1766 length=151 GCCTCTGCCGATTCGGCTTTCGCTGCGCCTCCCCCAATTCTGCCTCCGCTCTCGCTCGCTGGAGTGTCGGCATCCGTGCCAGTCGCATCTGGGGAAGAGGGGAGAAGGAGAATGATAGGAGATAGAGAAAGAGAGGGGATGAGGGGGAGGA +DRR240855.1 ST-E00104:1038:HVJ7MCCXY:1:1101:11464:1766 length=151 A<A<FJJJJ<<JJJJJJ-FJJJJAJJFJJFJJJJJJJJJJJJJJJFJJJAJAFJ7<JAJ7-7-F--<AAJ<FFAJFJA<7777<<AF<JJJJJJA7A-AFJJ<F-7AFJJJJFFF<AJFJJFAFAAAF-AAAA-AFJJAJJJJJ)AJ<AF) @DRR240855.2 ST-E00104:1038:HVJ7MCCXY:1:1101:12053:1766 length=151 TTTAATCATTTAGTCATAGACTAAATTAAATCATATAATATTTCTAGGATTTCGTTCGATATTTAGCTCCAATTTGAGTGAAACTTGAACCTAAATTCATCTAAATTCATAAGCTTTCCAATGATATAGAATTCACTAATTGATAAAATAT +DRR240855.2 ST-E00104:1038:HVJ7MCCXY:1:1101:12053:1766 length=151 AA<<AFJJJAJFJJJJJJJJJJJF-JJJJJJJJJJJJJ<AJJJF<A-FJJJJJJJJJJF-<J<<-FJJJJ-J--FAJJJF<<FFJJJ7JFFJFFF-AF-AAJJJJAJJFF--AAA--AFFFJJJFF<J-F7-7F-A-7FJJJAA7FF--A< @DRR240855.3 ST-E00104:1038:HVJ7MCCXY:1:1101:12378:1766 length=151 AAAGCAACTTTCTAAGATTTTTTCTTTTTGAAAACAAAACACACCGCTTAGCAGTTTGGGAAGCCGCACGTGAAAAATGAGGATGTTTTGATATCTTTCTCTTGCTGCGGAAAGCAGCATTAGGCATCAGGAGGACTATAGAACCTCCCCA +DRR240855.3 ST-E00104:1038:HVJ7MCCXY:1:1101:12378:1766 length=151 A<AF-JJJ<J-FFJJJFJFJJFJJJFJJJJJJJJJJJJFJJJJJFJJJJFAJJJAJJJJJ7FFJAJJJJF-7FFJJJJJJJJJ-AFFJFJJJJ<JAJJJFFJ--FJF<JF-7<FJA-A<A-AJFJFFJ<F<<)F7<AJJAAFJ<FF-)AFF
Reads edited:
@00000000000000000001 1 Hime GCCTCTGCCGATTCGGCTTTCGCTGCGCCTCCCCCAATTCTGCCTCCGCTCTCGCTCGCTGGAGTGTCGGCATCCGTGCCAGTCGCATCTGGGGAAGAGGGGAGAAGGAGAATGATAGGAGATAGAGAAAGAGAGGGGATGAGGGGGAGGA + A<A<FJJJJ<<JJJJJJ-FJJJJAJJFJJFJJJJJJJJJJJJJJJFJJJAJAFJ7<JAJ7-7-F--<AAJ<FFAJFJA<7777<<AF<JJJJJJA7A-AFJJ<F-7AFJJJJFFF<AJFJJFAFAAAF-AAAA-AFJJAJJJJJ)AJ<AF) @00000000000000000002 1 Hime TTTAATCATTTAGTCATAGACTAAATTAAATCATATAATATTTCTAGGATTTCGTTCGATATTTAGCTCCAATTTGAGTGAAACTTGAACCTAAATTCATCTAAATTCATAAGCTTTCCAATGATATAGAATTCACTAATTGATAAAATAT + AA<<AFJJJAJFJJJJJJJJJJJF-JJJJJJJJJJJJJ<AJJJF<A-FJJJJJJJJJJF-<J<<-FJJJJ-J--FAJJJF<<FFJJJ7JFFJFFF-AF-AAJJJJAJJFF--AAA--AFFFJJJFF<J-F7-7F-A-7FJJJAA7FF--A< @00000000000000000003 1 Hime AAAGCAACTTTCTAAGATTTTTTCTTTTTGAAAACAAAACACACCGCTTAGCAGTTTGGGAAGCCGCACGTGAAAAATGAGGATGTTTTGATATCTTTCTCTTGCTGCGGAAAGCAGCATTAGGCATCAGGAGGACTATAGAACCTCCCCA + A<AF-JJJ<J-FFJJJFJFJJFJJJFJJJJJJJJJJJJFJJJJJFJJJJFAJJJAJJJJJ7FFJAJJJJF-7FFJJJJJJJJJ-AFFJFJJJJ<JAJJJFFJ--FJF<JF-7<FJA-A<A-AJFJFFJ<F<<)F7<AJJAAFJ<FF-)AFF
Next I query the database: ` metagraph query --query-coords -i Himenomochi_metagraph_k31_l151_noline/Himenomochi/Himenomochi.dbg -a Himenomochi_metagraph_k31_l151_noline/Himenomochi/annotation/annotation.row_diff_brwt_coord.annodbg OsjWaxyExon1.fa | awk -F"\t" '{gsub(/:/,"\n"); print}' >tnoline
metagraph query --query-coords -i Himenomochi_metagraph_k31_l151_line1plus3/Himenomochi/Himenomochi.dbg -a Himenomochi_metagraph_k31_l151_line1plus3/Himenomochi/annotation/annotation.row_diff_brwt_coord.annodbg OsjWaxyExon1.fa | awk -F"\t" '{gsub(/:/,"\n"); print}' >t1plus3 `
` diff tnoline t1plus3 1c1 < 0 OsativaWaxyExon1 <JRCreads/Himenomochi_151_noline_R1.fastq.gz>
`
So the indexes are different (but only for read1). However, when I tried only editing the third line for each read, the differences were in read2.
I'm confused! and probably have some slip in my code. Can you give me some advice? How are the fastq files read, and what are the header requirements? Many thanks for your help!