tecangenomics / nudup

NuDup -- Marks/removes duplicate molecules based on the molecular tagging technology used in Tecan products.
http://www.tecangenomics.com
GNU Lesser General Public License v3.0
14 stars 9 forks source link

sort: ... disorder: ... Error #19

Closed boro2013 closed 5 years ago

boro2013 commented 5 years ago

Hi all,

I got the following error when trying to remove PCR duplicates using nudup.py script:

2019-01-14 14:19:24,773 [ INFO] - Deduplicating NuGEN paired end reads... 2019-01-14 14:19:24,998 [ INFO] - Using molecular tag sequence from Index FASTQ read 2019-01-14 14:19:24,999 [ INFO] - Appending molecular tag sequence to SAM/BAM read name sort: /tmp/nudup_6SZRo0_sorted.fq:2: disorder: NB551431:38:HYFFGBGX7:1:11101:10000:6104 CACCAGCC 2019-01-14 14:19:33,109 [ ERROR] -

I've paired-end reads files, and a separate fastq file containing the corresponding UMI indexes. After trimming, I got reads of different lengths. I've performed the alignment using STAR, and extracted uniquely mapped reads from the bam file, and applied the command of nudup for PE reads as follows:

python nudup.py --paired-end -f umi.fq -o deduplicated uniquely_reads.bam -s 8 -l 8

Any Idea ?

mlovci commented 5 years ago

Is the alignment file you're trying to de-duplicate sorted already? Can you send the output of head umi.fq and samtools view -H uniquely_reads.bam | tail and samtools view uniquely_reads.bam | head if the .bam is sorted?

boro2013 commented 5 years ago

Thank you for your reply, The alignment file I tried to de-duplicate is an unsorted bam file, but I got the same error with a sorted (by coordinates/name) bam file. head umi.fq @NB551431:38:HYFFGBGX7:1:11101:17599:1058 2:N:0:CGTCTAAC CTCCAAGA + EEEEEEEE @NB551431:38:HYFFGBGX7:1:11101:23194:1058 2:N:0:CGTCTAAC CAAATACG + EEEEAEAE @NB551431:38:HYFFGBGX7:1:11101:19361:1058 2:N:0:GGTCTAAC AGGCGAAT

and, samtools view -H uniquely_reads.bam | tail # for unsorted bam file @SQ SN:KI270419.1 LN:1029 @SQ SN:KI270336.1 LN:1026 @SQ SN:KI270312.1 LN:998 @SQ SN:KI270539.1 LN:993 @SQ SN:KI270385.1 LN:990 @SQ SN:KI270423.1 LN:981 @SQ SN:KI270392.1 LN:971 @SQ SN:KI270394.1 LN:970 @PG ID:STAR PN:STAR VN:STAR_2.5.4b CL:STAR --runThreadN 20 --genomeDir /home/uhg/gLatge/RNAseq_Guillaume/starIndexes/idx100 --readFilesIn /home/uhg/gLatge/scripts/H266_Test/H266.cleanedR1.fq /home/uhg/gLatge/scripts/H266_Test/H266.cleanedR2.fq --outFileNamePrefix H266_Test --outSAMtype BAM Unsorted --outSAMmapqUnique 255 --outSAMattrIHstart 0 --outFilterMultimapNmax 10 --outFilterScoreMinOverLread 0.4 --outFilterMatchNmin 15 --outFilterMatchNminOverLread 0.4 --outFilterMismatchNmax 12 --outFilterIntronMotifs RemoveNoncanonical --seedSearchStartLmax 20 --alignMatesGapMax 1000 --chimSegmentMin 20 --sjdbGTFfile /home/uhg/gLatge/RNAseq_Guillaume/starIndexes/Homo_sapiens.GRCh38.94.gtf @CO user command line: STAR --genomeDir /home/uhg/gLatge/RNAseq_Guillaume/starIndexes/idx100 --sjdbGTFfile /home/uhg/gLatge/RNAseq_Guillaume/starIndexes/Homo_sapiens.GRCh38.94.gtf --outFileNamePrefix H266_Test --seedSearchStartLmax 20 --outFilterMultimapNmax 10 --outFilterMismatchNmax 12 --alignMatesGapMax 1000 --outSAMattrIHstart 0 --outSAMmapqUnique 255 --outFilterIntronMotifs RemoveNoncanonical --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4 --outFilterMatchNmin 15 --chimSegmentMin 20 --runThreadN 20 --outSAMtype BAM Unsorted --readFilesIn /home/uhg/gLatge/scripts/H266_Test/H266.cleanedR1.fq /home/uhg/gLatge/scripts/H266_Test/H266.cleanedR2.fq

and, samtools view uniquely_reads.sorted.bam | head # for by coordinates sorted bam file NB551431:38:HYFFGBGX7:1:11101:24250:2644 163 1 23470 255 84M169628N31M = 193996 170616 AGGCTGCAAAGTGAAGGAGCAGGGGCTCCAGGTCTGGCGACAACCAGGGAAGGGACAGGGCAGGGATGGCTTGGACCACGAGAGGCATCTGGCCCTCCCTGCGCTGTGCCAGCAG /AAAAEEEEAAEEE<EE<EEEEEEEEEEE/EEEEEEEEE/E//EE/EEEEEEEE/EEEEEE6EEEEAEEEAEEE/EE/AE/EEEEE6/EEEEEEEEEEEEEEEEEEAEEEEEEEE NH:i:1 HI:i:0 AS:i:193 nM:i:2 NB551431:38:HYFFGBGX7:1:11101:1533:7968 163 1 138353 255 151M = 138658 446 TCACCTGAGGATGCCACAGTGAGACACCATCTGGGTCTGGAGGGTCCACTGTGAGGCAGAGGCTGACCTGTAGAGTCCGACAGTAGACAGAAGTTGGGCAAAAGCCTGATTTGAGGAAGTTTTGGGCTTCAAGAGTCAGCCACGAGGCAGG AAAAAEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEE<EEEEEEEEEEEEEEEEEEEEEAAEAEEEEEEEEEEEEAEAAEA<A/AAEE<<AEE</AAA6AEA NH:i:1 HI:i:0 AS:i:288 nM:i:1 NB551431:38:HYFFGBGX7:1:11101:1533:7968 83 1 138658 255 141M = 138353 -446 GGGAGGCAAGAGCAGGGCCTGCAGAGGCTGTTCTCAAGTCAAAGCTGGGCCTGTTGATGCCACCGGGAAGCAGAAGGTGGGCCTGGAGAGTTAGACTTGAGGAAGTTTTGGGCCTACATTGGCCGCCATTAGCTGGACAGG AAAEA<EE66/A</<<AA<<EEA<<A6</AA/EEE</<//</</EEEEAEEEE/EEEEE/EAEEE6AE6/EEAAEE/AE/AEE/E/E<EEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEE NH:i:1 HI:i:0 AS:i:288 nM:i:1 NB551431:38:HYFFGBGX7:1:11101:24250:2644 83 1 193996 255 1S90M19S = 23470 -170616 AGGCTGCAAAGTGAAGGAGCAGGGGCTCCAGGTCTGGCGACAACCAGGGAAGGGACAGGGCAGGGATGGCTTGGACCACGAGAGGCATCTGGCCCTCCCTGCGCTGTGCC EAAE/EEEEAE/EE/EE/EEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEE//EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NH:i:1 HI:i:0 AS:i:193 nM:i:2 NB551431:38:HYFFGBGX7:1:11101:6213:1993 99 1 629081 255 35S105M2S = 629081 105 AGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCCA EEEEEEEEEEEAEEEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEAEEEAAAA<EAEEAEE<EEEEEE<A<AAAEE NH:i:1 HI:i:0 AS:i:204 nM:i:2 NB551431:38:HYFFGBGX7:1:11101:21213:1918 99 1 629081 255 21S70M = 629081 70 GAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCAT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAEEEEEEEEEE NH:i:1 HI:i:0 AS:i:134 nM:i:2 NB551431:38:HYFFGBGX7:1:11101:17553:2488 99 1 629081 255 43S97M = 629081 97 TCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAAT EEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAEEEEEEEEEEEEE6EE<EAAE NH:i:1 HI:i:0 AS:i:188 nM:i:2 NB551431:38:HYFFGBGX7:1:11101:25854:3854 99 1 629081 255 40S88M = 629081 89 ACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACAT EEEEEEEEEEEEAEEEEEAE/EAEEAE<EAEEEE/EEE<EEE/E/EEEEEEEEEEEEE<EAEEEAEEEEEEEE/AAEAAEEEEEAEAAEEEEE/AAAEEEEE/EEA<EA<AA<AAAAAAAAAAAEAAE NH:i:1 HI:i:0 AS:i:171 nM:i:2 NB551431:38:HYFFGBGX7:1:11101:7222:6890 99 1 629081 255 32S44M = 629081 44 AGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATCCGCC EEEE6EEEEEAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE NH:i:1 HI:i:0 AS:i:79 nM:i:4 NB551431:38:HYFFGBGX7:1:11101:2206:7413 99 1 629081 255 55S78M = 629081 78 ATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAAT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<AAA6A NH:i:1 HI:i:0 AS:i:150 nM:i:2

mlovci commented 5 years ago

Your sorted .bam file should work, according to what you are showing. I suspect perhaps the filtering you are doing to select only unique alignments isn't working correctly. Can you attempt an alignment with the STAR parameter: --outFilterMultimapNmax 1

boro2013 commented 5 years ago

I've applied the following command to extract uniquely mapped reads from the bam file generated by STAR: samtools view -q 255 -o uniquely_reads.bam all_aligned_reads.bam And to check if that is correct, I've counted uniquely mapped reads in uniquely_reads.bam file using the following formula:

SE: samtools view -c -q 255 -F 0x2 uniquely_reads.bam

PE: samtools view -c -q 255 -f 0x2 uniquely_reads.bam

Number of uniquely reads = (PE/2) + SE (since I've paired-end reads)

and this number corresponds perfectly to what I got in the log file of STAR

I've also attempt the alignment using --outFilterMultimapNmax 1 but I got the same error: 2019-01-24 16:06:01,368 [ INFO] - Deduplicating NuGEN paired end reads... 2019-01-24 16:06:01,608 [ INFO] - Using molecular tag sequence from Index FASTQ read 2019-01-24 16:06:01,609 [ INFO] - Appending molecular tag sequence to SAM/BAM read name sort: /tmp/nudup_w0eoRd_sorted.fq:2: disorder: NB551431:38:HYFFGBGX7:1:11101:10000:6104 CACCAGCC 2019-01-24 16:06:09,705 [ ERROR] -

I assume you've already tested nudup on some bam file, can you tell me please which aligner have you used to get the alignment file, and the command line you've used to extract uniquely mapped reads ?

shuelga commented 5 years ago

We use STAR with the --outFilterMultimapNmax 1 flag and then do not need to do any filtering.

The sort error you are seeing is actually from the .fq file when it is trying to match umi with aligned reads. Is NB551431:38:HYFFGBGX7:1:11101:10000:6104 CACCAGCC in your umi.fq AND your BAM file?

boro2013 commented 5 years ago

I've used STAR with --outFilterMultimapNmax 1 and without filtering, but still I got the same error.

the read NB551431:38:HYFFGBGX7:1:11101:10000:6104 CACCAGCC is present in both the umi.fq and in the bam file: grep -nr NB551431:38:HYFFGBGX7:1:11101:10000:6104 umi.fq 272777:@NB551431:38:HYFFGBGX7:1:11101:10000:6104 2:N:0:CGTCTAAC 672777:@NB551431:38:HYFFGBGX7:1:11101:10000:6104 2:N:0:CGTCTAAC As you can see the read is present in umi.fq file. Now I check if UMI=CACCAGCC in the same file NUM=272778 sed "${NUM}q;d" umi.fq CACCAGCC and for the second: NUM=672778 sed "${NUM}q;d" umi.fq CACCAGCC As you can see the UMI is present.

What I've tried also, is to use another version of samtools for sorting, I've used samtools 1.2.0, but I got the same error.

Another attempt is perform the mapping using bwa aligner as follows: bwa mem -M -c 1 -t 12 $hg_ref_path $READR1 $READR2 > $output.sam samtools view -S -b $output.sam > $output.bam samtools sort -o $output.sorted..bam $output.bam

But when I called nudup, I got the same error by on another read, the error is: 2019-01-25 12:41:09,320 [ INFO] - Deduplicating NuGEN paired end reads... 2019-01-25 12:41:09,666 [ INFO] - Using molecular tag sequence from Index FASTQ read 2019-01-25 12:41:09,667 [ INFO] - Appending molecular tag sequence to SAM/BAM read name sort: /tmp/nudup_C4FnOL_sorted.fq:2: disorder: NB551431:38:HYFFGBGX7:1:11101:10000:6938 CCTTGGCA 2019-01-25 12:41:17,785 [ ERROR] -

boro2013 commented 5 years ago

The error comes when trying to sort umi.fq I've fixed the issue, but still I got another error: pipe broken. For this kind of error, the problem concerns the tmp directory. Using the -T $TMPDIR option to redirect the tmp directory to $TMPDIR is not enough. The issue can be fixed by adding the following line to ~/.bash_profile file: export TMPDIR="somewhere_you_redirect_the_tmp_directory"