uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
5 stars 1 forks source link

`parseFusionCatcher` which index to use? #233

Closed lydiayliu closed 2 years ago

lydiayliu commented 2 years ago

it's been a while since parseXXFusion parsers were developed and I believe lots has changed with the index and TVF/GVF lol

finally starting to test on real data XD

command

docker run -u $(id -u):$(id -g) -it --rm -v /hot/ref/reference/GRCh38-EBI-GENCODE34/:/reference/ -v /hot/projects/cpcgene/rna/rna-seq/pipeline_runs/call-FusionTranscript/output/:/FUSION/ -v /hot/users/yiyangliu/MoPepGen/:/data/ blcdsdockerregistry/mopepgen:dev /bin/bash

for a in /FUSION/*; do echo ${a};
b=$(basename -- "$a"); echo ${b};
moPepGen parseFusionCatcher \
    --index-dir /data/Index/GRCh38-EBI-GENCODE34/ \
    --output-prefix /data/Parser/Fusion/fusioncatcher-1.33/${b} \
    -f ${a}/fusioncatcher-1.33/final-list_candidate-fusion-genes.txt \
    --source fusioncatcher
done

error

/FUSION/CPCG0100                                                                                                                                                           
CPCG0100                                                                                                                                                                   
[ 2021-11-22 20:35:57 ] moPepGen parseFusionCatcher started                                                                                                                
Traceback (most recent call last):                                                                                                                                         
  File "/usr/local/bin/moPepGen", line 8, in <module>                                                                                                                      
    sys.exit(main())                                                                                                                                                       
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 75, in main                                                                                 
    args.func(args)                                                                                                                                                        
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/parse_fusion_catcher.py", line 72, in parse_fusion_catcher                                                     
    var_records = record.convert_to_variant_records(anno, genome)                                                                                                          
  File "/usr/local/lib/python3.8/site-packages/moPepGen/parser/FusionCatcherParser.py", line 94, in convert_to_variant_records                                             
    donor_gene_model = anno.get_gene_model_from_unversioned_id(                                                                                                            
  File "/usr/local/lib/python3.8/site-packages/moPepGen/gtf/GenomicAnnotation.py", line 365, in get_gene_model_from_unversioned_id                                         
    self.create_gene_id_version_mapper()                                                                                                                                   
  File "/usr/local/lib/python3.8/site-packages/moPepGen/gtf/GenomicAnnotation.py", line 345, in create_gene_id_version_mapper                                              
    raise ValueError('Unversioned gene ID collapsed.')                                                                                                                     
ValueError: Unversioned gene ID collapsed.

example input /hot/projects/cpcgene/rna/rna-seq/pipeline_runs/call-FusionTranscript/output/CPCG0100/fusioncatcher-1.33/final-list_candidate-fusion-genes.txt

For some reason the folder contains a /hot/projects/cpcgene/rna/rna-seq/pipeline_runs/call-FusionTranscript/output/CPCG0100/fusioncatcher-1.33/final-list_candidate-fusion-genes.hg19.txt as well... so im assuming the one above is hg38

zhuchcn commented 2 years ago

This ENSEMBL gene ID 'ENSG00000228572' has two GENCODE gene IDs, 'ENSG00000228572.7' and 'ENSG00000228572.7_PAR_Y'. The former is on chrX and latter on chrY. However, in ENSEMBL's GTF, the chrY version of the gene does not exist. I need to understand how GENCODE/ENSEMBL name their genes a little more.

Btw, I think we ended up using ENSEMBL-102 to run FusionCatcher. According to this thread https://github.com/ndaniel/fusioncatcher/issues/190 , it doesn't seem possible to run ENSEMBL100 on FusionCatcher.

lydiayliu commented 2 years ago

interesting... this probably only happens when a gene is on the pseudo-autosomal region???

Btw, I think we ended up using ENSEMBL-102 to run FusionCatcher. According to this thread ndaniel/fusioncatcher#190 , it doesn't seem possible to run ENSEMBL100 on FusionCatcher.

oh i didn't realize that, I was running it with a gencode index, that explains

lydiayliu commented 2 years ago

alright I'm running parseFusionCatcher with GRCh38-EBI-ENSEMBL102 here: /hot/users/yiyangliu/MoPepGen/Index/GRCh38-EBI-ENSEMBL102/

Getting a different error for every sample apparently XD

 I have no name!@5104ba0e8b37:/$ for a in /FUSION/*; do echo ${a};                                                                                                           
> b=$(basename -- "$a"); echo ${b};                                                                                                                                         
> moPepGen parseFusionCatcher \                                                                                                                                             
>     --index-dir /data/Index/GRCh38-EBI-ENSEMBL102/ \                                                                                                                      
>     --output-prefix /data/Parser/Fusion/fusioncatcher-1.33/${b} \                                                                                                         
>     -f ${a}/fusioncatcher-1.33/final-list_candidate-fusion-genes.txt \                                                                                                    
>     --source fusioncatcher                                                                                                                                                
> done                                                                                                                                                                      
/FUSION/CPCG0100                                                                                                                                                            
CPCG0100                                                                                                                                                                    
[ 2021-11-25 17:14:47 ] moPepGen parseFusionCatcher started                                                                                                                 
Traceback (most recent call last):                                                                                                                                          
  File "/usr/local/bin/moPepGen", line 8, in <module>                                                                                                                       
    sys.exit(main())                                                                                                                                                        
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 75, in main                                                                                  
    args.func(args)                                                                                                                                                         
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/parse_fusion_catcher.py", line 72, in parse_fusion_catcher                                                      
    var_records = record.convert_to_variant_records(anno, genome)                                                                                                           
  File "/usr/local/lib/python3.8/site-packages/moPepGen/parser/FusionCatcherParser.py", line 105, in convert_to_variant_records                                             
    raise ValueError(                                                                                                                                                       
ValueError: Annotation GTF version mismatch with FusionCatcher.  

/FUSION/CPCG0183                                                                                                                                                            
CPCG0183                                                                                                                                                                    
[ 2021-11-25 17:16:13 ] moPepGen parseFusionCatcher started                                                                                                                 
Traceback (most recent call last):                                                                                                                                          
  File "/usr/local/bin/moPepGen", line 8, in <module>                                                                                                                       
    sys.exit(main())                                                                                                                                                        
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 75, in main                                                                                  
    args.func(args)                                                                                                                                                         
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/parse_fusion_catcher.py", line 72, in parse_fusion_catcher                                                      
    var_records = record.convert_to_variant_records(anno, genome)                                                                                                           
  File "/usr/local/lib/python3.8/site-packages/moPepGen/parser/FusionCatcherParser.py", line 96, in convert_to_variant_records                                              
    accepter_gene_model = anno.get_gene_model_from_unversioned_id(                                                                                                          
  File "/usr/local/lib/python3.8/site-packages/moPepGen/gtf/GenomicAnnotation.py", line 362, in get_gene_model_from_unversioned_id                                          
    return self.genes[gene_id]                                                                                                                                              
KeyError: 'ENSG09000000013' 

/FUSION/CPCG0184                                                                                                                                                            
CPCG0184                                                                                                                                                                    
[ 2021-11-25 17:17:38 ] moPepGen parseFusionCatcher started                                                                                                                 
Traceback (most recent call last):                                                                                                                                          
  File "/usr/local/bin/moPepGen", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 75, in main
    args.func(args)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/parse_fusion_catcher.py", line 72, in parse_fusion_catcher
    var_records = record.convert_to_variant_records(anno, genome)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/parser/FusionCatcherParser.py", line 169, in convert_to_variant_records
    ref=seq[left_breakpoint_genetic],
  File "/usr/local/lib/python3.8/site-packages/moPepGen/dna/DNASeqRecord.py", line 319, in __getitem__
    return super().__getitem__(index)
  File "/usr/local/lib/python3.8/site-packages/Bio/SeqRecord.py", line 451, in __getitem__
    return self.seq[index]
  File "/usr/local/lib/python3.8/site-packages/Bio/Seq.py", line 430, in __getitem__
    return chr(self._data[index])
IndexError: index out of range

/FUSION/CPCG0196
CPCG0196
[ 2021-11-25 17:19:03 ] moPepGen parseFusionCatcher started
Traceback (most recent call last):
  File "/usr/local/bin/moPepGen", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 75, in main
    args.func(args)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/parse_fusion_catcher.py", line 72, in parse_fusion_catcher
    var_records = record.convert_to_variant_records(anno, genome)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/parser/FusionCatcherParser.py", line 96, in convert_to_variant_records
    accepter_gene_model = anno.get_gene_model_from_unversioned_id(
  File "/usr/local/lib/python3.8/site-packages/moPepGen/gtf/GenomicAnnotation.py", line 362, in get_gene_model_from_unversioned_id
    return self.genes[gene_id]
KeyError: 'ENSG09000001017'

Which is the correct index to run with this?? I'm so confused XD. If we run FusionCatcher with ensembl, how would it work with adding all the other variants (which are gencode but could be ensembl) and with splitFASTA?

zhuchcn commented 2 years ago

interesting... this probably only happens when a gene is on the pseudo-autosomal region???

You are right! All of them are PAR.

[x for x in gencode34_id_mapper.values() if len(x) > 1]
[['ENST00000431238.7', 'ENST00000431238.7_PAR_Y'],
 ['ENST00000399012.6', 'ENST00000399012.6_PAR_Y'],
 ['ENST00000484611.7', 'ENST00000484611.7_PAR_Y'],
 ['ENST00000430923.7', 'ENST00000430923.7_PAR_Y'],
 ['ENST00000445062.6', 'ENST00000445062.6_PAR_Y'],
 ['ENST00000429181.6', 'ENST00000429181.6_PAR_Y'],
...

So I can just get ride of anything with a suffix of 'PAR_Y'

lydiayliu commented 2 years ago

So I can just get ride of anything with a suffix of 'PAR_Y'

yeah sounds like a plan! might be worth 2 seconds checking that the transcript sequences are the exact same, i certainly hope they are XD

zhuchcn commented 2 years ago

Here we go. So we take all unversioned tx ID of GENCODE that also has a PAR_Y version, and get the transcript from ENSEMBEL GTF, and they are all chrX.

Counter(anno_ensembl100.transcripts[x].transcript.chrom for x in tx_id_with_par)
Counter({'X': 161})

Gene ID is the same thins:

Counter(anno_ensembl100.genes[x].chrom for x in gene_id_with_par)
Counter(anno_ensembl100.genes[x].chrom for x in gene_id_with_par)

So I guess ENSEMBL just does not include those PAR_Y at all. Then the genes/transcripts from FusionCatcher must only come from the chrX versions.

lydiayliu commented 2 years ago

by definition, the "versions" should be the exact same right XD but yeah, I think just using the transcript ids without the PAR_Y should work fine

zhuchcn commented 2 years ago

/FUSION/CPCG0100

C6orf47 vs C6ORF47 Seems like FusionCatcher turns all gene symbols to upper case

zhuchcn commented 2 years ago

Seems like there are just some gene IDs called by FusionCatcher that are not in either ENSEMBl or GENCODE's GTF. FusionCatcher seems to also use some other databases besides the GTF file, so maybe we can just ignore those if they can't be found from GTF. You can use gencode as long as it's the same version. For example, GNECODE 36 is equivalent to ENSEMBL102 because they release together. I think the unversioned gene IDs are stable, right? So if even if we use the gencode 34 index, it should be able to assign the variants to the correct coordinates?

lydiayliu commented 2 years ago

Seems like there are just some gene IDs called by FusionCatcher that are not in either ENSEMBl or GENCODE's GTF.

Interesting... what do they expect people to do with that. lets just ignore those XD

GENCODE 36 is equivalent to ENSEMBL102 because they release together

interesting thought, would it matter for splitFASTA cuz the rest of the databases are all from gencodev34?

I dont feel like gencode 36 would change that much from gencode 34 but who knows...