Closed jfy133 closed 1 year ago
My idea of the aggregation table:
BGC_ID | Sample_ID | Prediction_tool | Contig_ID | Product_class | Contig_edge | BGC_start | BGC_end | BGC_length | Protein_count | Protein_ID | PFAM_ID | MIBiG_ID | InterPro_ID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Sample_1 | antiSMASH | c_001 | Arylpolyene | no | 123 | 456 | 334 | 2 | OGCKDNOF_00056;OGCKDNOF_00057 | PF00668 | BGC0001894 | |
2 | Sample_1 | GECCO | c_002 | RiPP | one-side-truncated | 123 | 456 | 334 | 1 | OGCKDNOF_00056 | PF00668;PF08242 | IPR001031 | |
3 | Sample_2 | antiSMASH | c_001 | NRPS | two-side-truncated | 123 | 456 | 334 | 1 | OGCKDNOF_00056 | PF08242 | BGC0001894 | |
4 | Sample_2 | DeepBGC | c_002 | Arylpolyene | no | 123 | 456 | 334 | 3 | OGCKDNOF_00056;OGCKDNOF_00058;OGCKDNOF_00059 | PF00668;PF08242;PF08243 | BGC0001894 |
Protein_count
and Protein_ID
refer to the annotations from prodigal/prokka.
Feedback welcome @nf-core/funcscan so that I can implement the comBGC tool without changing too much later on.
Current considerations:
MIBiG_ID
+ InterPro_ID
) could be combined because GECCO gives InterPro but not MIBiG IDs and the other tools vice versa. Column name may be Database_annotations
and prefix the entry of each row with "MIBiG-" or "InterPro-"? E.g.... | Database_annotations |
---|---|
... | MIBiG-BGC0001894;BGC0001895 |
... | InterPro-IPR001031;IPR001032;IPR001033 |
Description of feature
Should produce two files:
summary (Sample_Name,Tool,No_Hits) aggregated (Sample_Name,Tool,Contig,Hit_Name,Probability,....)