shandley / hecatomb

hecatomb is a virome analysis pipeline for analysis of Illumina sequence data
MIT License
53 stars 12 forks source link

missing fields in outfile #68

Closed mihinduk closed 2 years ago

mihinduk commented 2 years ago

Hi Mike,

The tophit.m8 should have these columns: query | target | evalue | pident | fident | nident | mismatch | qcov | tcov | qstart | qend | qlen | tstart | tend | tlen | alnlen | bits | qheader | theader | taxid | taxname | lineage

But the last 3 columns are empty: taxid taxname lineage. They will be important for parsing the contig taxonomy by kingdom, family etc.

From: rule PRIMARY_AA_taxonomy_assignment

Kathie

beardymcjohnface commented 2 years ago

You should probably be using the tophits from the secondary output, unless you're talking about the contig annotations as opposed to the seqtable annotations?

mihinduk commented 2 years ago

I am talking about contig annotations.

Also, I found that the contigSeqTable.tsv has 14 columns for samples that DON'T have taxonomy, but 19 for those that do:

contigID    seqID   start   stop    len qual    count   CPM alnType taxMethod   kingdom phylum  class   order   family  genus   species baltimoreType   baltimoreGroup
contig_1000 169-06-08-13-12_CAGATC:1:140311 11  252 241 17  NA  NA  NA  NA  NA  NA  NA  NA                  
contig_1000 120-06-02-24-12_ATCACG:3:171191 214 406 192 0   3   2.989033237 nt  LCA Viruses Cressdnaviricota    Arfiviricetes   Cirlivirales    Circoviridae    Circovirus  Circovirus sp.  ssDNA   II
beardymcjohnface commented 2 years ago

Hi Kathy, That issue with the contigSeqTable is fixed in the dev branch and will be in the next release.

The rule PRIMARY_AA_taxonomy_assignment is part of the read-based annotations and it's only real purpose is to find sequences that look like a virus so that they can be analysed in the secondary search. You should take the annotations from the secondary searches. If you look at the secondary AA mmseqs directory, the file MMSEQS_AA_SECONDARY_tophit_aln_sorted should have all the columns.

The direct contig annotations at the moment are a bit simplistic, but those files should be in ASSEMBLY/CONTIG_DICTIONARY/FLYE/results. It's simplistic because it currently only uses the primary nt database, not the secondary nt database.

mihinduk commented 2 years ago

Thanks, Mike. This makes sense

From: Michael Roach @.> Date: Monday, March 7, 2022 at 11:13 PM To: shandley/hecatomb @.> Cc: Mihindukulasuriya, Kathie @.>, Author @.> Subject: Re: [shandley/hecatomb] missing fields in outfile (Issue #68)

Hi Kathy, That issue with the contigSeqTable is fixed in the dev branch and will be in the next release.

The rule PRIMARY_AA_taxonomy_assignment is part of the read-based annotations and it's only real purpose is to find sequences that look like a virus so that they can be analysed in the secondary search. You should take the annotations from the secondary searches. If you look at the secondary AA mmseqs directory, the file MMSEQS_AA_SECONDARY_tophit_aln_sorted should have all the columns.

The direct contig annotations at the moment are a bit simplistic, but those files should be in ASSEMBLY/CONTIG_DICTIONARY/FLYE/results. It's simplistic because it currently only uses the primary nt database, not the secondary nt database.

— Reply to this email directly, view it on GitHubhttps://github.com/shandley/hecatomb/issues/68#issuecomment-1061413440, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDVLDN2H42DQVZGPOXN26DU63OVZANCNFSM5P4EN55Q. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>


The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

beardymcjohnface commented 2 years ago

Should be fixed in new release