suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
226 stars 49 forks source link

What is the source for included protein_domain_hg38? #125

Closed ahdee closed 3 years ago

ahdee commented 3 years ago

Thanks, awesome fusion caller! I would like to know what the source is for the database for the protein_domains_hg38_GRCh38_v2.1.0.gff3? Thanks!

suhrig commented 3 years ago

It contains Pfam protein domains.

ahdee commented 3 years ago

@suhrig thanks but do you also know which site it was download at? I went to the pfam site but it looks there are no gff3 files? reason why I ask is because I get ask a lot where the domain annotations are comming from and it would be nice to have more details.

suhrig commented 3 years ago

The GFF3 file is not available for download anywhere other than in the Arriba release package. On the Pfam site you will find this file. To get the GFF3 file, you need to map the protein coordinates to genomic coordinates. I used the Bioconductor package ensembldb for this purpose. This is how the file was generated.

ahdee commented 3 years ago

@suhrig great thank you, this is very helpful.