suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
226 stars 49 forks source link

warning of unknown gene when using supplied protein domains.gff3 #156

Closed alexander-e-f-smith closed 2 years ago

alexander-e-f-smith commented 2 years ago

Hi I seem to be getting a incompatibility and associated warning when using the protein_domains_hg38_GRCh38_v2.0.0.gff3 file that comes with arriba. It's a long list of warnings all saying unknown gene..perhaps for all genes seen. eg: WARNING: unknown gene: FAM231B ENSG00000268991 note: I'm using the reference fasta and annotations gff file packaged with starfusion for assembly and annotation through arriba. All other running of arriba, including the output and detection of known fusions, is seeming good and as expected. Thanks for any help in this matter. A

suhrig commented 2 years ago

Hi,

This is normal and a minor issue. It is not caused by mixing reference files from STAR-Fusion and Arriba (well, only to a minor degree). You also get the warnings with the assembly/annotation which Arriba's download_references.sh script downloads.

The warnings about unknown genes contained in the protein domain file are independent of the fusion calls. The warnings should always be the same regardless of the input sample. (Can you confirm?) Arriba processes the protein domains file, and whenever it finds a gene that cannot be matched to the annotation (=the GTF file supplied via the parameter -a), a warning is issued. Given that the protein domain file contains tens of thousands of genes, the list of warnings is actually very short. Only a negligible fraction of genes is affected - most likely not any genes that are found to be involved in fusions. Does this dispel your concerns?

Regards, Sebastian

alexander-e-f-smith commented 2 years ago

Hi Sebastian. Yes that answers my concerns and makes complete sense, thanks very much! Yep it is the same for all samples. best Alex