Error in gtf library when trying to add stringtie output

griffithlab / VAtools

A set of tools to annotate VCF files with expression and readcount data

http://www.vatools.org

MIT License

25 stars 12 forks source link

Error in gtf library when trying to add stringtie output #77

Open lukaas33 opened 2 months ago

lukaas33 commented 2 months ago

I am getting the following error from the gtf library when trying to annotate my vcf with stringtie output.

$ vcf-expression-annotator -o /shared_dir/temp.annotated.vcf -s sample /shared_dir/temp.vep.vcf /shared_dir/temp.abundance.tsv stringtie transcript
Traceback (most recent call last):
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/gtfparse/read_gtf.py", line 121, in parse_with_polars_lazy
    **kwargs).lazy()
AttributeError: 'LazyFrame' object has no attribute 'lazy'

Version 2.0.1 with Python 3.6

I have also reported this issue at the gtf page: https://github.com/openvax/gtfparse/issues/49

lukaas33 commented 2 months ago

I have now ran the same command in the latest vatools Docker image and am getting a new error:

> vcf-expression-annotator -o /shared_dir/temp.annotated.vcf -s sample /shared_dir/temp.vep.vcf /shared_dir/temp.abundance.tsv stringtie transcript
Traceback (most recent call last):
  File "parsers.pyx", line 1160, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

lukaas33 commented 2 months ago

The start of my stringtie output:

Gene ID Gene Name   Reference   Strand  Start   End Coverage    FPKM    TPM
ENSG00000282881 TMEM275 1   -   46532166    46543969    0.0 0.0 0.0
ENSG00000201405 Y_RNA   1   +   23370254    23370346    0.0 0.0 0.0
ENSG00000143774 GUK1    1   +   228139962   228148984   0.0 0.0 0.0
ENSG00000288775 .   1   -   159776325   159779383   0.0 0.0 0.0
ENSG00000239887 C1orf226    1   +   162378841   162386812   0.0 0.0 0.0
ENSG00000200575 RNU6-414P   1   +   61816419    61816522    0.0 0.0 0.0
ENSG00000251785 RNA5SP20    1   +   77614869    77614952    0.0 0.0 0.0
ENSG00000237872 POU5F1P4    1   +   155433178   155434262   0.0 0.0 0.0

susannasiebert commented 2 months ago

For the first error, I suspect that one of the dependencies you have installed is incompatible with gtfparse. Here are the versions in our Docker images:

gtfparse==1.3.0
numpy==1.26.1
pandas==2.1.1
pysam==0.22.0
python-dateutil==2.8.2
pytz==2023.3.post1
six==1.16.0
testfixtures==7.2.0
tzdata==2023.3
vatools==5.1.0
vcfpy==0.12.3

Try downgrading your dependencies to match these versions.

susannasiebert commented 2 months ago

I'm unable to reproduce the error in your second comment with the stringtie output provided. Can you please attach all of your input files (VCF and full stringtie TSV)?

lukaas33 commented 2 months ago

Hi @susannasiebert, can I send these to you privately to maintain privacy of this data?

susannasiebert commented 2 months ago

Yes, absolutely. My email is susanna.kiwala@wustl.edu

susannasiebert commented 2 months ago

My apologies for the belated replies. After investigating your files, it looks like you are trying to use a gene abundance file in transcript mode. If you switch your command to

vcf-expression-annotator -o /shared_dir/temp.annotated.vcf -s sample /shared_dir/temp.vep.vcf /shared_dir/temp.abundance.tsv stringtie gene

It works without problems.

The transcript abundance file from stringtie is in gtf format while the gene abundance file is in tsv format. Mixing them up leads to unexpected errors like the one you are seeing. I've added issue #78 to add better error handling for this case.

lukaas33 commented 2 months ago

Ah, and sorry for my ignorance. But wouldn't I need to add both of them? So the tsv and gtf files.

Is there a way to do this with one command or will this always involve two steps?

Or does the transcript level expression always contain more detail?

susannasiebert commented 2 months ago

You would need to run this as two steps, unfortunately.

lukaas33 commented 2 months ago

Ah so it is recommended to also add gene expression level besides transcript expression level?

susannasiebert commented 2 months ago

Yes, the gene expression levels are, for example, used during tiering in the aggregated report.

lukaas33 commented 2 months ago

Ah in that case it may be nice to have a feature to add both files. Or to support command line piping so that it can be done in one line and creates only one file.