Closed lydiayliu closed 2 years ago
This ENSEMBL gene ID 'ENSG00000228572' has two GENCODE gene IDs, 'ENSG00000228572.7' and 'ENSG00000228572.7_PAR_Y'. The former is on chrX and latter on chrY. However, in ENSEMBL's GTF, the chrY version of the gene does not exist. I need to understand how GENCODE/ENSEMBL name their genes a little more.
Btw, I think we ended up using ENSEMBL-102 to run FusionCatcher. According to this thread https://github.com/ndaniel/fusioncatcher/issues/190 , it doesn't seem possible to run ENSEMBL100 on FusionCatcher.
interesting... this probably only happens when a gene is on the pseudo-autosomal region???
Btw, I think we ended up using ENSEMBL-102 to run FusionCatcher. According to this thread ndaniel/fusioncatcher#190 , it doesn't seem possible to run ENSEMBL100 on FusionCatcher.
oh i didn't realize that, I was running it with a gencode index, that explains
alright I'm running parseFusionCatcher
with GRCh38-EBI-ENSEMBL102
here:
/hot/users/yiyangliu/MoPepGen/Index/GRCh38-EBI-ENSEMBL102/
Getting a different error for every sample apparently XD
I have no name!@5104ba0e8b37:/$ for a in /FUSION/*; do echo ${a};
> b=$(basename -- "$a"); echo ${b};
> moPepGen parseFusionCatcher \
> --index-dir /data/Index/GRCh38-EBI-ENSEMBL102/ \
> --output-prefix /data/Parser/Fusion/fusioncatcher-1.33/${b} \
> -f ${a}/fusioncatcher-1.33/final-list_candidate-fusion-genes.txt \
> --source fusioncatcher
> done
/FUSION/CPCG0100
CPCG0100
[ 2021-11-25 17:14:47 ] moPepGen parseFusionCatcher started
Traceback (most recent call last):
File "/usr/local/bin/moPepGen", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 75, in main
args.func(args)
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/parse_fusion_catcher.py", line 72, in parse_fusion_catcher
var_records = record.convert_to_variant_records(anno, genome)
File "/usr/local/lib/python3.8/site-packages/moPepGen/parser/FusionCatcherParser.py", line 105, in convert_to_variant_records
raise ValueError(
ValueError: Annotation GTF version mismatch with FusionCatcher.
/FUSION/CPCG0183
CPCG0183
[ 2021-11-25 17:16:13 ] moPepGen parseFusionCatcher started
Traceback (most recent call last):
File "/usr/local/bin/moPepGen", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 75, in main
args.func(args)
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/parse_fusion_catcher.py", line 72, in parse_fusion_catcher
var_records = record.convert_to_variant_records(anno, genome)
File "/usr/local/lib/python3.8/site-packages/moPepGen/parser/FusionCatcherParser.py", line 96, in convert_to_variant_records
accepter_gene_model = anno.get_gene_model_from_unversioned_id(
File "/usr/local/lib/python3.8/site-packages/moPepGen/gtf/GenomicAnnotation.py", line 362, in get_gene_model_from_unversioned_id
return self.genes[gene_id]
KeyError: 'ENSG09000000013'
/FUSION/CPCG0184
CPCG0184
[ 2021-11-25 17:17:38 ] moPepGen parseFusionCatcher started
Traceback (most recent call last):
File "/usr/local/bin/moPepGen", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 75, in main
args.func(args)
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/parse_fusion_catcher.py", line 72, in parse_fusion_catcher
var_records = record.convert_to_variant_records(anno, genome)
File "/usr/local/lib/python3.8/site-packages/moPepGen/parser/FusionCatcherParser.py", line 169, in convert_to_variant_records
ref=seq[left_breakpoint_genetic],
File "/usr/local/lib/python3.8/site-packages/moPepGen/dna/DNASeqRecord.py", line 319, in __getitem__
return super().__getitem__(index)
File "/usr/local/lib/python3.8/site-packages/Bio/SeqRecord.py", line 451, in __getitem__
return self.seq[index]
File "/usr/local/lib/python3.8/site-packages/Bio/Seq.py", line 430, in __getitem__
return chr(self._data[index])
IndexError: index out of range
/FUSION/CPCG0196
CPCG0196
[ 2021-11-25 17:19:03 ] moPepGen parseFusionCatcher started
Traceback (most recent call last):
File "/usr/local/bin/moPepGen", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 75, in main
args.func(args)
File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/parse_fusion_catcher.py", line 72, in parse_fusion_catcher
var_records = record.convert_to_variant_records(anno, genome)
File "/usr/local/lib/python3.8/site-packages/moPepGen/parser/FusionCatcherParser.py", line 96, in convert_to_variant_records
accepter_gene_model = anno.get_gene_model_from_unversioned_id(
File "/usr/local/lib/python3.8/site-packages/moPepGen/gtf/GenomicAnnotation.py", line 362, in get_gene_model_from_unversioned_id
return self.genes[gene_id]
KeyError: 'ENSG09000001017'
Which is the correct index to run with this?? I'm so confused XD. If we run FusionCatcher with ensembl, how would it work with adding all the other variants (which are gencode but could be ensembl) and with splitFASTA
?
interesting... this probably only happens when a gene is on the pseudo-autosomal region???
You are right! All of them are PAR.
[x for x in gencode34_id_mapper.values() if len(x) > 1]
[['ENST00000431238.7', 'ENST00000431238.7_PAR_Y'],
['ENST00000399012.6', 'ENST00000399012.6_PAR_Y'],
['ENST00000484611.7', 'ENST00000484611.7_PAR_Y'],
['ENST00000430923.7', 'ENST00000430923.7_PAR_Y'],
['ENST00000445062.6', 'ENST00000445062.6_PAR_Y'],
['ENST00000429181.6', 'ENST00000429181.6_PAR_Y'],
...
So I can just get ride of anything with a suffix of 'PAR_Y'
So I can just get ride of anything with a suffix of 'PAR_Y'
yeah sounds like a plan! might be worth 2 seconds checking that the transcript sequences are the exact same, i certainly hope they are XD
Here we go. So we take all unversioned tx ID of GENCODE that also has a PAR_Y version, and get the transcript from ENSEMBEL GTF, and they are all chrX.
Counter(anno_ensembl100.transcripts[x].transcript.chrom for x in tx_id_with_par)
Counter({'X': 161})
Gene ID is the same thins:
Counter(anno_ensembl100.genes[x].chrom for x in gene_id_with_par)
Counter(anno_ensembl100.genes[x].chrom for x in gene_id_with_par)
So I guess ENSEMBL just does not include those PAR_Y at all. Then the genes/transcripts from FusionCatcher must only come from the chrX versions.
by definition, the "versions" should be the exact same right XD but yeah, I think just using the transcript ids without the PAR_Y should work fine
/FUSION/CPCG0100
C6orf47 vs C6ORF47 Seems like FusionCatcher turns all gene symbols to upper case
Seems like there are just some gene IDs called by FusionCatcher that are not in either ENSEMBl or GENCODE's GTF. FusionCatcher seems to also use some other databases besides the GTF file, so maybe we can just ignore those if they can't be found from GTF. You can use gencode as long as it's the same version. For example, GNECODE 36 is equivalent to ENSEMBL102 because they release together. I think the unversioned gene IDs are stable, right? So if even if we use the gencode 34 index, it should be able to assign the variants to the correct coordinates?
Seems like there are just some gene IDs called by FusionCatcher that are not in either ENSEMBl or GENCODE's GTF.
Interesting... what do they expect people to do with that. lets just ignore those XD
GENCODE 36 is equivalent to ENSEMBL102 because they release together
interesting thought, would it matter for splitFASTA cuz the rest of the databases are all from gencodev34?
I dont feel like gencode 36 would change that much from gencode 34 but who knows...
it's been a while since parseXXFusion parsers were developed and I believe lots has changed with the index and TVF/GVF lol
finally starting to test on real data XD
command
error
example input
/hot/projects/cpcgene/rna/rna-seq/pipeline_runs/call-FusionTranscript/output/CPCG0100/fusioncatcher-1.33/final-list_candidate-fusion-genes.txt
For some reason the folder contains a
/hot/projects/cpcgene/rna/rna-seq/pipeline_runs/call-FusionTranscript/output/CPCG0100/fusioncatcher-1.33/final-list_candidate-fusion-genes.hg19.txt
as well... so im assuming the one above is hg38