nanopore-wgs-consortium / NA12878

Data and analysis for NA12878 genome on nanopore
Other
372 stars 93 forks source link

Duplicate reads in Native RNA fastq file #93

Closed tleonardi closed 3 years ago

tleonardi commented 4 years ago

Hi, I've noticed that the fastq file for direct RNA-Seq basecalled with Guppy 3.2.6 contains 94143 duplicate reads. See below for details. This could be related to #88.

> awk '(NR+3)%4==0' NA12878-DirectRNA_All_Guppy_3.2.6.fastq | wc -l
13361612

> awk '(NR+3)%4==0' NA12878-DirectRNA_All_Guppy_3.2.6.fastq | sort -u | wc -l
13267469

> awk '(NR+3)%4==0' NA12878-DirectRNA_All_Guppy_3.2.6.fastq | sort | uniq -d | head -2
@000057d9-bb59-4c31-b3fb-b1b8a21b88ed runid=2e6b3641a1ae6974d98c3c8d6b118938ee9837a1 sampleid=Bham_Run2_20171011_directRNA read=953 ch=498 start_time=2017-10-11T18:07:41Z
@000068c2-1150-4bc8-a820-6d5986816b0a runid=2e6b3641a1ae6974d98c3c8d6b118938ee9837a1 sampleid=Bham_Run2_20171011_directRNA read=2979 ch=245 start_time=2017-10-11T18:56:58Z

> grep -A3 000057d9-bb59-4c31-b3fb-b1b8a21b88ed NA12878-DirectRNA_All_Guppy_3.2.6.fastq
@000057d9-bb59-4c31-b3fb-b1b8a21b88ed runid=2e6b3641a1ae6974d98c3c8d6b118938ee9837a1 sampleid=Bham_Run2_20171011_directRNA read=953 ch=498 start_time=2017-10-11T18:07:41Z
GAGCAUGGCCUGCGCUGCGCCACGAUGUCCGGGGAGGGAGUCAGCCAGGAGCUUGGGAAAGGGACUGCGACGCGCCCCCAGGGCCGGUCCCGGAGGGCUGAUCCGCUGCAUACAGCAUGUGAGGUUUGCCCGUUUUUGCUGAGAGGACGCGUCAUAGUCCGCUGAAGGCCUGGGAAUCAGGCAUGAAGUCAUCAAUAUCAACCUGAAAAAUAAGCCGAGUGGUUUAAGAAAUCCUCUGGUCUGGUGCGCAGUUUUGGAAACAGUCGGGUCAGCUGAUCUACAAGCUUGCCAUCACCUGUUGAGUACCUGGAUGAUACCCAGGGAAGAAGCUGUUUGCCGGAUGACCCCUAUGAGAAAAGCUGCCAGUAGAUGAUUUUAGAGUUGUUAAGGUGCCAUCCUUGGUAAGGAAGCUUUUAUUAGAAGCCAAAAAUAAAGAAGACUAUGCUGGCCUAAAAGAAGAAUUCGUAAAAGAAUACCAAGCUGAGGAGGUUUGACUAAUAAGAAGACGACCUUCUUGUGGCAAUCAUAUCUUAUGAUUGAUACUCAUCUGGCCUGGUUUGAUAGCUGGAAGCAAAUGAAGUAUCCAGAGUGUGUAGACCACUCCAAACUGAAUUGGUGGAUGGCAGCCAUCAGAAGGAAGAUCCCACAGUCUCAGCCCUGCUUACUAGUGAGAAAGACUGGCAAGGGUUUCCAAGAGCUCAUUCUUACAGAACAGCCCUGAAGGCCUGACUAUGGGCUACUGAAGGGGCAGGAAGUCAGCAAUAAAUGUCUGAAUAUCCUC
+
-;,'%%//?><,,16;8,%0%(-(,.)&(,-89<3179*-/)(8;H?DO-09-5'9@=9*+01%#"##(#$%.;..0+2#2233#)-,$1$45020$0)5+*,)&$"&,127:9:0$$$+0)+)/&&&2:;8;%&/5743,*5;;AE7>38)&&((.,.1(')-A@H92(%'+-1+)-/86588=:@?86186:<=@A8*&&)145***($$$0*$&(72.31%&/>664>*+$%&.362;=@C<;2-8.6@?4>597,+20;62&/688--'7B><$%$1*-.+%**(&%&''**(+%*8:/,+,/,568/86//5;<.;77+7;A:;/1,+&&(%(0:K:EB?>7169=B>9B<<><..*;822($'GDE@:,$%5<A/-:;&-$$6<<IC<45*0$973?4*2*7?D756/0-6=9=BLCA?7233/37615;<:999710%18:<=;89=>=2280%*(&2*/114;9)2,)0+*1.35%/G8<5*%69A<;4<.2<4AB>G.2=;&'.(*3'$1.7**''#%#&#$$'-12779>5..&*&&&)00=9--*0,%0)$#%)*277-,+-$#$2"&+($$%(&('51+(,$#2+'$#$$',(('-/41((*)(63;.()(.3'540%##66.1*+1:913//-;393%'52324.34+('**,,:9/(652C:;/*-)*:8>96867.>#&B:F1204-'$()3:-=*'7156;7+/(,25:.*-)'%3:63+/###'(-++6GA=.78--.)4/;930))(%')*%,3,$#$#$%)$
--
@000057d9-bb59-4c31-b3fb-b1b8a21b88ed runid=2e6b3641a1ae6974d98c3c8d6b118938ee9837a1 sampleid=Bham_Run2_20171011_directRNA read=953 ch=498 start_time=2017-10-11T18:07:41Z
GAGCAUGGCCUGCGCUGCGCCACGAUGUCCGGGGAGGGAGUCAGCCAGGAGCUUGGGAAAGGGACUGCGACGCGCCCCCAGGGCCGGUCCCGGAGGGCUGAUCCGCUGCAUACAGCAUGUGAGGUUUGCCCGUUUUUGCUGAGAGGACGCGUCAUAGUCCGCUGAAGGCCUGGGAAUCAGGCAUGAAGUCAUCAAUAUCAACCUGAAAAAUAAGCCGAGUGGUUUAAGAAAUCCUCUGGUCUGGUGCGCAGUUUUGGAAACAGUCGGGUCAGCUGAUCUACAAGCUUGCCAUCACCUGUUGAGUACCUGGAUGAUACCCAGGGAAGAAGCUGUUUGCCGGAUGACCCCUAUGAGAAAAGCUGCCAGUAGAUGAUUUUAGAGUUGUUAAGGUGCCAUCCUUGGUAAGGAAGCUUUUAUUAGAAGCCAAAAAUAAAGAAGACUAUGCUGGCCUAAAAGAAGAAUUCGUAAAAGAAUACCAAGCUGAGGAGGUUUGACUAAUAAGAAGACGACCUUCUUGUGGCAAUCAUAUCUUAUGAUUGAUACUCAUCUGGCCUGGUUUGAUAGCUGGAAGCAAAUGAAGUAUCCAGAGUGUGUAGACCACUCCAAACUGAAUUGGUGGAUGGCAGCCAUCAGAAGGAAGAUCCCACAGUCUCAGCCCUGCUUACUAGUGAGAAAGACUGGCAAGGGUUUCCAAGAGCUCAUUCUUACAGAACAGCCCUGAAGGCCUGACUAUGGGCUACUGAAGGGGCAGGAAGUCAGCAAUAAAUGUCUGAAUAUCCUC
+
-;,'%%//?><,,16;8,%0%(-(,.)&(,-89<3179*-/)(8;H?DO-09-5'9@=9*+01%#"##(#$%.;..0+2#2233#)-,$1$45020$0)5+*,)&$"&,127:9:0$$$+0)+)/&&&2:;8;%&/5743,*5;;AE7>38)&&((.,.1(')-A@H92(%'+-1+)-/86588=:@?86186:<=@A8*&&)145***($$$0*$&(72.31%&/>664>*+$%&.362;=@C<;2-8.6@?4>597,+20;62&/688--'7B><$%$1*-.+%**(&%&''**(+%*8:/,+,/,568/86//5;<.;77+7;A:;/1,+&&(%(0:K:EB?>7169=B>9B<<><..*;822($'GDE@:,$%5<A/-:;&-$$6<<IC<45*0$973?4*2*7?D756/0-6=9=BLCA?7233/37615;<:999710%18:<=;89=>=2280%*(&2*/114;9)2,)0+*1.35%/G8<5*%69A<;4<.2<4AB>G.2=;&'.(*3'$1.7**''#%#&#$$'-12779>5..&*&&&)00=9--*0,%0)$#%)*277-,+-$#$2"&+($$%(&('51+(,$#2+'$#$$',(('-/41((*)(63;.()(.3'540%##66.1*+1:913//-;393%'52324.34+('**,,:9/(652C:;/*-)*:8>96867.>#&B:F1204-'$()3:-=*'7156;7+/(,25:.*-)'%3:63+/###'(-++6GA=.78--.)4/;930))(%')*%,3,$#$#$%)$
mitenjain commented 4 years ago

Hi @tleonardi ,

Sorry for the delay in getting to this. We were able to identify the issue and are repackaging data for re-basecalling this week. We will make both the packed signal and basecalled data available in a week.

FYI we also found an issue with cDNA Guppy basecalls which should get resolved with this round of basecalling.

Sorry for the hassle. -Miten

mitenjain commented 3 years ago

Hi @tleonardi ,

Minor update. I am still downloading the data (many millions of files and hence going slow). As soon as that's done, we'll rebasecall and share updated files.

-Miten

tleonardi commented 3 years ago

Hi @mitenjain, thanks for the update, no problem at all!

tom

mitenjain commented 3 years ago

Hi Tom (@tleonardi)

We just (finally) updated the data and fixed the issues in rel2. The new basecalls are with Guppy 4.2.2. Hope this helps.

Sorry about the time it took to fix this. Let me know if you have any questions.

tleonardi commented 3 years ago

Brilliant, thanks Miten!