BGC Summary Table - Githubissues

nf-core / funcscan

(Meta-)genome screening for functional and natural product gene sequences

https://nf-co.re/funcscan

MIT License

74 stars 20 forks source link

BGC Summary Table #64

Closed jfy133 closed 1 year ago

jfy133 commented 2 years ago

Description of feature

Should produce two files:

summary (Sample_Name,Tool,No_Hits) aggregated (Sample_Name,Tool,Contig,Hit_Name,Probability,....)

jasmezz commented 2 years ago

My idea of the aggregation table:

BGC_ID	Sample_ID	Prediction_tool	Contig_ID	Product_class	Contig_edge	BGC_start	BGC_end	BGC_length	Protein_count	Protein_ID	PFAM_ID	MIBiG_ID	InterPro_ID
1	Sample_1	antiSMASH	c_001	Arylpolyene	no	123	456	334	2	OGCKDNOF_00056;OGCKDNOF_00057	PF00668	BGC0001894
2	Sample_1	GECCO	c_002	RiPP	one-side-truncated	123	456	334	1	OGCKDNOF_00056	PF00668;PF08242		IPR001031
3	Sample_2	antiSMASH	c_001	NRPS	two-side-truncated	123	456	334	1	OGCKDNOF_00056	PF08242	BGC0001894
4	Sample_2	DeepBGC	c_002	Arylpolyene	no	123	456	334	3	OGCKDNOF_00056;OGCKDNOF_00058;OGCKDNOF_00059	PF00668;PF08242;PF08243	BGC0001894

Protein_count and Protein_ID refer to the annotations from prodigal/prokka.

Feedback welcome @nf-core/funcscan so that I can implement the comBGC tool without changing too much later on.

Current considerations:

We could give probability values for GECCO + DeepBGC. Just antiSMASH predicts differently and has no comparable probablity/confidence values.
The last 2 columns (MIBiG_ID + InterPro_ID) could be combined because GECCO gives InterPro but not MIBiG IDs and the other tools vice versa. Column name may be Database_annotations and prefix the entry of each row with "MIBiG-" or "InterPro-"? E.g.

...	Database_annotations
...	MIBiG-BGC0001894;BGC0001895
...	InterPro-IPR001031;IPR001032;IPR001033