Open hanclair opened 2 years ago
Hi,
Thank you for creating this issue.
The problem is because of the GTF file format, which is explained here. Note that fields are tab-separated, and the last column (field 9) is semicolon-separated list of tag-value pairs. You missed the semicolons. Tejaas looks for three tags in this field, namely "gene_name", "gene_id" and "gene_type". The first column should be the chromosome number, with or without the 'chr' prefix. However, with the input file that you are using, the debug messages should have included a line which says,
No gene annotations found in GTF file. Check feature 'gene' is present, with 'gene_id' and 'gene_name' annotations
I am surprised that this message was not produced. Would it be possible for you to share the GTF file that you are using so that I can check why it is not producing this particular error message?
Explanation of the error messages
IndexError: arrays used as indices must be of integer (or boolean) type
This happens because an empty array is used. No genes were selected because the gene names provided in the expression file did not match the gene names in the GTF file (obtained from field 9 of GTF with tag "gene_id"). In your GTF file, no "gene_id" tag-value pair is provided.
Troubleshooting 1: You found the same error because the gene_id from gencode.v19 GTF are different from the gene names that you are using.
Troubleshooting 2: Tejaas allows missing genotype.
Troubleshooting 3: I think this is because the GTF file was not tab-separated.
Fixed GTF File format
chr1 AUGUSTUS gene 1 29982 0.29 - . gene_id "g5649"; gene_type "protein_coding"; gene_name "FOO";
chr1 AUGUSTUS mRNA 28451 29982 0.29 - . gene_id "g13599.t1"; Parent "g13599"; gene_type "protein_coding"; gene_name "BAR";
chr1 AUGUSTUS CDS 28451 28712 0.43 - 1 Parent "g13599.t1"; gene_type "protein_coding"; gene_id "g2"; gene_name "FOOBAR";
chr1 AUGUSTUS exon 28451 28712 . - . Parent "g13599.t1"; gene_type "protein_coding"; gene_id "g3"; gene_name "FOO2";
chr1 AUGUSTUS CDS 29864 29982 0.29 - 0 Parent "g13599.t1"; gene_type "protein_coding"; gene_id "g4"; gene_name "FOO3";
chr1 AUGUSTUS exon 29864 29982 . - . Parent "g13599.t1"; gene_type "protein_coding"; gene_id "g5"; gene_name "FOO4";
Note the new tag-value pairs, tab separation between fields.
Thank you so much for this extremely detailed response.
I am unfortunately still have trouble with my GTF file format. I revised the input GTF to look like this:
chr1 AUGUSTUS gene 1 1540 0.98 - . gene_id "g5649"; gene_type "protein_coding"; gene_name "g5649";
chr1 AUGUSTUS mRNA 1 1540 0.98 - . gene_id "g5649"; Parent "g5649"; gene_type "protein_coding"; gene_name "g5649";
chr1 AUGUSTUS CDS 1110 1540 0.98 - 0 Parent "g5649.t1"; gene_type "protein_coding"; gene_id "g5649"; gene_name "g5649";
chr1 AUGUSTUS exon 1110 1540 . - . Parent "g5649.t1"; gene_type "protein_coding"; gene_id "g5649"; gene_name "g5649";
The error I am now hitting is:
2022-09-27 10:34:39,010 | tejaas.utils.args | INFO | Running TEJAAS v1.0.1
2022-09-27 10:34:39,012 | tejaas.utils.args | INFO | Method: rr
2022-09-27 10:34:39,013 | tejaas.utils.args | INFO | Null Model: perm
2022-09-27 10:34:39,013 | tejaas.utils.args | INFO | Sigma_beta: 0.1
2022-09-27 10:34:39,013 | tejaas.main | DEBUG | Using 2 cores
2022-09-27 10:34:39,041 | tejaas.iotools.data | DEBUG | Reading expression levels for trans-eQTL discovery
2022-09-27 10:34:39,476 | tejaas.iotools.data | DEBUG | Reading expression levels for target gene discovery
2022-09-27 10:34:40,391 | tejaas.iotools.data | DEBUG | Found 7155 genes of 218 samples
2022-09-27 10:34:40,391 | tejaas.iotools.data | DEBUG | Reading gencode file for gene information
Traceback (most recent call last):
File "/groups/stern/home/hanc/miniconda3/envs/py39/bin/tejaas", line 8, in <module>
sys.exit(main())
File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/main.py", line 73, in main
data.load()
File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/data.py", line 241, in load
gene_info = readgtf.gencode(self.args.gtf_file, trim=self.args.gxtrim, biotype=self.args.biotype)
File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/readgtf.py", line 48, in gencode
gene_name = infolist[4].strip().split(' ')[1].replace('"','')
IndexError: list index out of range
I checked that the GTF file is properly tab separated. Attached is the first 1000 lines of the GTF that I am using, as an example.
Thank you again for all your time. I really appreciate it.
The GTF file format is correct. I found that the problem is because the PyPI release version of our package was not using the latest Github changes. We sincerely apologize for your trouble. Hope you will find Tejaas useful to detect trans-eQTLs. If you would need any help in selecting the parameters or running the software, please let us know.
I have now updated the package to the latest Github version. You can update it using:
pip install tejaas --upgrade
and hopefully it will work. If not, please do let us know.
Also, note that the --biotype
flag in the command line requires an argument (protein_coding
or lncRNA
).
The program runs for me now with the update you mentioned!
An additional questions: I don't know if I am misunderstanding the --chrom tag. My understanding is that if I have multiple chromosomes in the vcf (each chromosome designated as an integer in the CHROM column of the vcf), then I can use this parameter to designate which chromosome I would choose SNPs from (ie --chrom 2 --include-SNPs 0:100 would mean first 100 SNPs from chr2). However, regardless of what integer I put for the --chrom tag, it defaults to picking the SNPs from the first chromosome in the vcf. I could even put a number that doesn't exist as a chromosome in the vcf and it would still run and default to SNPs that are on the first chromosome.
I have been getting around this problem by pre-subsetting the vcf to each only contain one chromosome. I see in the parallelization example it says "file_pathhere${CHRM}.vcf.gz", so perhaps subsetting the vcf into one chromosome per file is what I am meant to do? Just wondering if I am misunderstanding this, since this is a required parameter.
Hello,
I am hitting an error with what I assume to be a file format issue. This is the error message that I get:
This is my Tejaas command:
This is the format of my expression file:
My VCF file:
My annotation file:
This GTF was generated from a GFF3 using
rtracklayer
in R, and then I manually addedgene_type "protein_coding";
to the last field.Troubleshooting that I tried: 1) Judging by the solution from the closed issue "GTF read error", I assumed my issue is the GTF file format. I tried to verify that the expression and VCF file formats are acceptable by test running Tejaas with my expression and VCF files and the gencode.v19 GTF that came with the example, but hit the same error as above. 2) I thought perhaps Tejaas does not allow for VCF with missing data, so I removed all the sites with any missing genotypes in the VCF but also hit the same error as above. (Are VCFs with missing genotypes allowed?) 3) I simplified the chromosome names in the annotation file to match those from gencode, like this:
but ended up with a different error:
Now I am not sure which input file is likely causing the issue. I am able to successfully run the example files that came with the download.
Any suggestions would much appreciated. Thank you.