replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 17 forks source link

nextclade pos is now the nucleotide before the insertion #185

Closed RaverJay closed 2 years ago

RaverJay commented 2 years ago

Quick fix for #184

I assume the pos is now the start of the amino acid before the insertion (e.g. R for R214REPE instead of E) Should give correct amino acid and codon number now

Please test, I can't right now =)

hoelzer commented 2 years ago

Awesome thx for the quick fix! I also can't test right now - but latest tomorrow evening ;)

On Sat, 18 Dec 2021, 19:05 Sebastian Krautwurst, @.***> wrote:

Quick fix for #184 https://github.com/replikation/poreCov/issues/184

I assume the pos is now the start of the amino acid before the insertion (e.g. R for R214REPE instead of E) Should give correct amino acid and codon number now

Please test, I can't right now =)

You can view, comment on, or merge this pull request online at:

https://github.com/replikation/poreCov/pull/185 Commit Summary

File Changes

(1 file https://github.com/replikation/poreCov/pull/185/files)

Patch Links:

— Reply to this email directly, view it on GitHub https://github.com/replikation/poreCov/pull/185, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADN2CZ6EKRY3WA3QKTEL2L3URTEPXANCNFSM5KK4IHNA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

hoelzer commented 2 years ago

Hm @RaverJay now for the sequences that have insertions nothing from Nexctlade is reported : )

image

RaverJay commented 2 years ago

Argh, what o_o

Can you paste relevant lines from the files from the conversion process? (Original nextclade results and the converted nextclade results)

I suspect it's some small error but otherwise I can look at it tomorrow

Martin Hölzer @.***> schrieb am So., 19. Dez. 2021, 15:47:

Hm @RaverJay https://github.com/RaverJay now for the sequences that have insertions nothing from Nexctlade is reported : )

[image: image] https://user-images.githubusercontent.com/14393703/146679191-48a59dee-377c-455c-bea6-bce7e941da9d.png

— Reply to this email directly, view it on GitHub https://github.com/replikation/poreCov/pull/185#issuecomment-997404654, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHCB7ENCPZHMVHGRAJFWPTLURXV7FANCNFSM5KK4IHNA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

hoelzer commented 2 years ago

Ah the two relevant processes (where insertions are in the nextclade output) actually failed

[43/ce94f0] process > determine_mutations_wf:add_aainsertions (23) [100%] 23 of 23, failed: 2 ✔

From the .command.log

 cat raver-work/c1/b95e0d5839bbbf7fe5e84222daf4b7/.command.log
LOG: Started convert_insertions_nt2aa.py ...
Traceback (most recent call last):
  File "raver-porecov/bin/convert_insertions_nt2aa.py", line 583, in <module>
    res_data.at[sample, 'aaInsertionsCustom'] = insertions_nt_to_aa(nt_ins) if type(nt_ins) == str else ''
  File "raver-porecov/bin/convert_insertions_nt2aa.py", line 564, in insertions_nt_to_aa
    aa_ins_list.append(gene + ':' + aa_before + str(codon) + aa_before + aminos)
TypeError: can only concatenate str (not "NoneType") to str
hoelzer commented 2 years ago

The nextclade output and input for your conversion script is:

seqName clade   qc.overallScore qc.overallStatus        totalSubstitutions      totalDeletions  totalInsertions totalFrameShifts        totalAminoacidSubstitutions     totalAminoacidDeletions totalMissing  totalNonACGTNs   totalPcrPrimerChanges   substitutions   deletions       insertions      frameShifts     aaSubstitutions aaDeletions     missing nonACGTNs       pcrPrimerChanges        alignmentScore  alignmentStart alignmentEnd    qc.missingData.missingDataThreshold     qc.missingData.score    qc.missingData.status   qc.missingData.totalMissing     qc.mixedSites.mixedSitesThreshold       qc.mixedSites.score   qc.mixedSites.status     qc.mixedSites.totalMixedSites   qc.privateMutations.cutoff      qc.privateMutations.excess      qc.privateMutations.score       qc.privateMutations.status      qc.privateMutations.total      qc.snpClusters.clusteredSNPs    qc.snpClusters.score    qc.snpClusters.status   qc.snpClusters.totalSNPs        qc.frameShifts.frameShifts      qc.frameShifts.totalFrameShifts qc.frameShifts.frameShiftsIgnored      qc.frameShifts.totalFrameShiftsIgnored  qc.frameShifts.score    qc.frameShifts.status   qc.stopCodons.stopCodons        qc.stopCodons.totalStopCodons   qc.stopCodons.score     qc.stopCodons.status   errors
Samplename    "21K (Omicron)" 10.145405       good    49      39      9       0       41      16      1160    0       5       C241T,A2832G,C3037T,T5386G,C5730T,G8393A,C10029T,C10449A,A11537G,T13195C,C14408T,C15240T,A18163G,C21762T,C21846T,G22578A,T22673C,C22674T,T22679C,C22686T,G22992A,C22995A,A23013C,A23040G,G23048A,A23055G,A23063T,T23075C,C23202A,A23403G,C23525T,T23599G,C23604A,G23948T,C24130A,A24424T,T24469A,C24503T,C25000T,C25584T,C26270T,G26709A,A27259C,C27807T,A28271T,C28311T,G28881A,G28882A,G28883C      6513-6515,11285-11293,21765-21770,21987-21995,22194-22196,28362-28370   22204:GAGCCAGAA       E:T9I,M:A63T,N:P13L,N:R203K,N:G204R,ORF1a:K856R,ORF1a:T1822I,ORF1a:L2084I,ORF1a:A2710T,ORF1a:T3255I,ORF1a:P3395H,ORF1a:I3758V,ORF1b:P314L,ORF1b:I1566V,ORF9b:P10S,S:A67V,S:T95I,S:Y145D,S:L212I,S:G339D,S:S371L,S:S373P,S:S375F,S:S477N,S:T478K,S:E484A,S:Q493R,S:G496S,S:Q498R,S:N501Y,S:Y505H,S:T547K,S:D614G,S:H655Y,S:N679K,S:P681H,S:D796Y,S:N856K,S:Q954H,S:N969K,S:L981F N:E31-,N:R32-,N:S33-,ORF1a:S2083-,ORF1a:L3674-,ORF1a:S3675-,ORF1a:G3676-,ORF9b:E27-,ORF9b:N28-,ORF9b:A29-,S:H69-,S:V70-,S:G142-,S:V143-,S:Y144-,S:N211-        1-50,22786-22974,23612-23876,26299,26339-26694,26941,26943,26957-27177,29828-29903            Charité_E_F:C26270T,ChinaCDC_N_F:G28881A;G28882A;G28883C,USCDC_N1_P:C28311T      89343   0       29903   3000.000000     31.851852       mediocre        1160    10.000000       0.000000        good    0     24.000000        -8.000000       0.000000        good    0.000000                0.000000        good    0               0               0       0.000000        good            0       0.000000        good

sample_clade.tsv.gz

hoelzer commented 2 years ago

Thats the called insertion on nt lvl:

awk 'BEGIN{FS="\t"};{print $16}' sample_clade.tsv insertions 22204:GAGCCAGAA

which corresponds to

EPE

so this should be fine.

Question is, what's on pos 22204 (should be R to be the correct insertion)

RaverJay commented 2 years ago

Okay so it seems Nextclade changed the positionso that it is now the last nucleotide of the preceding amino acid (R) instead of the first nucleotide of the first inserted aa (E), so it was 22205 before, and is 22204 now

before: '22205:GAGCCAGAA' -> 'S:R214REPE' new: '22204:GAGCCAGAA' -> 'S:R214REPE'

changed the indexing accordingly, please test

hoelzer commented 2 years ago

uff

well done!

image

I will merge this, thanks a lot for the quick fix @RaverJay !

I will also close the issue but open another one for SNPeff... this should be way more stable then using Nextclade where we even auto update the container an thus might not see such changes that fuch up the conversion