tgen / bisbee

alternative splicing analysis pipeline
MIT License
17 stars 4 forks source link

Incorporating Non-Ensembl Data - "no such column: exon_number" #8

Open MattHuff opened 2 years ago

MattHuff commented 2 years ago

I have used Bisbee to call significant alternative splicing events, and I would like to provide sequences for these alternatively spliced proteins. While the default scripts of Bisbee do not allow for use of organisms not available on Ensembl, the script uses pyensembl, which includes commands to load outside data into pyensembl's database. I have created my own copy of the build.py script designed to load my organism and get used as a pyensembl object. However, the most recent updates give me the following error:

sys.version_info(major=3, minor=10, micro=1, releaselevel='final', serial=0)
<module 'pandas._version' from '/home/mhuff10/miniconda3/envs/mpythonX/lib/python3.10/site-packages/pandas/_version.py'>
/pickett_flora/projects/green_ash_genome/version_1/1.4/analysis/rob_getExonCounts/2_bisbee/get_sequences/utils.py
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /pickett_flora/projects/green_ash_genome/version_1/1.4/final_files/Fpennsylvanica_v1.4_genes_CDS.fa.pickle
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /pickett_flora/projects/green_ash_genome/version_1/1.4/final_files/Fpennsylvanica_v1.4_genes_AA.fa.pickle
WARNING:pyensembl.database:Encountered error "no such column: exon_number" from query "
            SELECT exon_number, start, end
            FROM exon
            WHERE transcript_id = ?
        " with parameters ['Fp_g4.t1']
Traceback (most recent call last):
  File "/home/mhuff10/miniconda3/envs/mpythonX/lib/python3.10/site-packages/pyensembl/common.py", line 61, in wrapped_fn
    return cache[cache_key]
KeyError: (<pyensembl.database.Database object at 0x7f1381806e90>, ('feature', 'exon'), ('filter_column', 'transcript_id'), ('filter_value', 'Fp_g4.t1'), ('select_column_names', ('exon_number', 'start', 'end')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pickett_flora/projects/green_ash_genome/version_1/1.4/analysis/rob_getExonCounts/2_bisbee/get_sequences/build_Fp.py", line 81, in <module>
    transcript_table=bb.find_matching_transcripts(Fp_data,row["gene"],event_coords)
  File "/pickett_flora/projects/green_ash_genome/version_1/1.4/analysis/rob_getExonCounts/2_bisbee/get_sequences/utils.py", line 191, in find_matching_transcripts
    transcript_table.loc[tid,"matching_isoform"]=get_matching_isoform(transcript,event_coords)
  File "/pickett_flora/projects/green_ash_genome/version_1/1.4/analysis/rob_getExonCounts/2_bisbee/get_sequences/utils.py", line 195, in get_matching_isoform
    exons=pd.DataFrame(transcript.exon_intervals,columns=["start","end"]).sort_values(by="start")
  File "/home/mhuff10/miniconda3/envs/mpythonX/lib/python3.10/site-packages/memoized_property.py", line 40, in fget_memoized
    setattr(self, attr_name, fget(self))
  File "/home/mhuff10/miniconda3/envs/mpythonX/lib/python3.10/site-packages/pyensembl/transcript.py", line 265, in exon_intervals
    results = self.db.query(
  File "/home/mhuff10/miniconda3/envs/mpythonX/lib/python3.10/site-packages/pyensembl/common.py", line 63, in wrapped_fn
    value = fn(*args, **kwargs)
  File "/home/mhuff10/miniconda3/envs/mpythonX/lib/python3.10/site-packages/pyensembl/database.py", line 467, in query
    return self.run_sql_query(
  File "/home/mhuff10/miniconda3/envs/mpythonX/lib/python3.10/site-packages/pyensembl/database.py", line 428, in run_sql_query
    cursor = self.connection.execute(sql, query_params)
sqlite3.OperationalError: no such column: exon_number

This is the command that was run:

python ./build_Fp.py ../all_samples/GreenAsh_Diff_Alt3Prime.bisbeeDiff.csv A3 9 GreenAsh_Diff_Alt3Prime 95 Fraxinus_pennsylvanica_genome.v1.4.scaffoldsOver10k.fasta

And here is the command that was added to my custom copy of build.py:

## lines 26 to 31:
Fp_data = pyensembl.Genome(
        reference_name='Fraxinus_pennsylvanica_genome.v1.4.scaffoldsOver10k',
        annotation_name='Fraxinus_pennsylvanica_genome_v1.4',
        gtf_path_or_url='./Fpennsylvanica_v1.4_genes.gtf',
        transcript_fasta_paths_or_urls='./Fpennsylvanica_v1.4_genes_CDS.fa',
        protein_fasta_paths_or_urls='./Fpennsylvanica_v1.4_genes_AA.fa')
Fp_data.index()

line 81

transcript_table=bb.find_matching_transcripts(Fp_data,row["gene"],event_coords)

line 98

transcript=Fp_data.transcript_by_id(tid)

If this is an issue on the issue of pyensembl, please let me know.

MattHuff commented 2 years ago

I have confirmed that this was due to the GTF file I provided not including exon_number as a feature; after generating a GTF file containing this information, I was able to get the build script working with my genome.

However, my concern now is that none of my detected splice variants appear to have any coding variants. Even with a feature such as "exon skipping," all of the identifed variants are marked as "non-coding." I provided CDS and AA information when building my database, which is consistent with the GFF3 and GTF files I have provided throughout the study. I have tried renaming the CDS and AA sequences in the provided FASTA files, but that hasn't fixed the issue. I understand this is an unprecedented use of Bisbee, but I want to see if there are comparable issues in using default Ensembl.