Closed Khajidu closed 1 year ago
Is it possible to have access to the genome and GTF file being used here? I realize the reads may be (I believe, is) currently private, but the genome and annotation should be all we need to reproduce this error.
Transferring ongoing, file is about 500K, pipeline also gives the same error message with the original NCBI GTF (genome is also publicly available on NCBI) Edit: cannot transfer GTF (unsupported type), all files are available on NCBI (Scyliorhinus canicula genome + GFF/GTF)
I was able to dig a bit further upstream, and the error occurs when pyranges
is trying to parse the GTF file. The error trace looks like this.
❯ pyroe make-splici --dedup-seqs GCF_902713615.1_sScyCan1.1_genomic.fna GCF_902713615.1_sScyCan1.1_genomic.gtf 100 sScyCan1_index
Traceback (most recent call last):
File "/home/rob/.local/bin/pyroe", line 195, in <module>
make_splici_txome(
File "/home/rob/.local/lib/python3.9/site-packages/pyroe/make_splici_txome.py", line 255, in make_splici_txome
gr = pr.read_gtf(gtf_path)
File "/home/rob/.local/lib/python3.9/site-packages/pyranges/readers.py", line 311, in read_gtf
gr = read_gtf_full(f, as_df, nrows, _skiprows, duplicate_attr)
File "/home/rob/.local/lib/python3.9/site-packages/pyranges/readers.py", line 343, in read_gtf_full
extra = _to_rows(df.Attribute)
File "/home/rob/.local/lib/python3.9/site-packages/pyranges/readers.py", line 367, in to_rows
rowdicts.append({k: v
File "/home/rob/.local/lib/python3.9/site-packages/pyranges/readers.py", line 368, in <dictcomp>
for k, v in [kv.replace('"', '').split(None, 1)
ValueError: not enough values to unpack (expected 2, got 1)
As an update, I attempted to see if this was the result of a "poorly-formed" GTF file. I instead downloaded the GFF file and used gffread
to convert it to a GTF file. Processing got much further, but terminated with a non-zero return code from bedtools
. The error was:
Differing number of BED fields encountered at line: 7. Exiting...
Indeed we can see this is the case:
NC_001950.1 5257 5941 rna-ATP6 . + S gene-ATP6
NC_001950.1 5099 5267 rna-ATP8 . + S gene-ATP8
NC_001950.1 2626 4180 rna-COX1 . + S gene-COX1
NC_001950.1 4333 5024 rna-COX2 . + S gene-COX2
NC_001950.1 5940 6726 rna-COX3 . + S gene-COX3
NC_001950.1 11516 12660 rna-CYTB . + S gene-CYTB
NC_001950.1 1122 1192 rna-NC_001950.1:1123..1192 . + S
NC_001950.1 12660 12732 rna-NC_001950.1:12661..12732 . + S
There is a field missing from the records starting at line 7. I believe this is a result of these records not having an associated gene name. I think this may be a pyroe
issue and in such cases we will want to just e.g. give the transcript a gene name equal to its transcript name. Of course, it doesn't suggest what is wrong with the original GTF file itself.
Thank you for digging further into this @rob-p ! So its technically something upstream that we cannot fix on a pipeline level I assume? Not sure if we want to open the can of worms to add functionality that does certain checks on the GFF/GTF provided, as this could also give users a certain "false sense of security" by making the workflow run but potentially also not doing the right thing. Not sure what we can / should do here on a pipeline level 🤔
Indeed. @dongzehe is looking into this upstream. If it's something we can reliably fix then we'll probably bump pyroe and we can include the new version here. Otherwise, we'll just have to find a way to fix the GTF externally to process such samples.
There were several issues at play here. The first was that the input GTF file cannot be parsed by pyranges
at all. That is a problem upstream in pyranges
and I'm not sure how to fix it. I think the best thing is to report it in that repo with a link to the offending GTF and let them figure out if (a) it's a non-conformance of their parser, or (b) it's a fundamental problem with he GTF file itself. The second issue is that, even after that is resolved, there are several records with no gene_id
. Finally, there were some transcript records with no gene_id
nor gene_name
. We solved the first case by copying the gene_name
as the gene_id
if the latter is not present. We solved the second issue by using the transcript_id
as the gene_id
if neither a gene_id
no gene_name
is present.
@dongzehe and I have made a version of the GTF file that fixes all of the issues. That file can be obtained from here.
Here is how the cleaned GTF was arrived at
The following command was used to clean the GTF file (since pyranges
would not parse it).
gffread -o new_gtf.gff GCF_902713615.1_sScyCan1.1_genomic.gff --gtf -E --keep-genes
This creates a parsable GTF file, but there are 2 problems. First, gffread
adds gene-
in front of the gene ID for no obvious reason, which just makes things more complicated. We fix that with the following command:
rg --passthru -N “\”gene-” -r “\”" new_gtf.gtf > new_gtf.fixed.gtf
Finally, there are a small number ~24 records of transcripts that have no associated gene ID. So we have to augment those records with a corresponding gene_id
so that they can be processed. This was performed using the newest version of pyroe
, which lives in the development branch of the repository and should be released as 0.6.3
soon.
Hi all,
Given the new_gtf.fixed.gtf
file from the rg
command above, the clean_gtf.gtf
file @rob-p uploaded is from the following process:
If using the development branch of the repository (clone it and run pip install .
), the clean_gtf.gtf
can be obtained by running
pyroe make-splici GCF_902713615.1_sScyCan1.1_genomic.fna new_gtf.fixed.gtf 100 output_dir --write-clean-gtf
With the new_gtf.fixed.gtf
file and pyroe 0.6.3, the splici transcriptome can be made with some warnings saying there are missing gene_id
and gene_name
values.
The --write-clean-gtf
flag tells pyroe to export the imputed clean GTF file to the output_dir/clean_gtf.gtf
, in which all missing gene_id
and gene_name
are imputed.
Here 100
in the command represents the read length. As we were not sure what the read length is, we use 100. It doesn't affect the clean_gtf.gtf
. Feel free to change it.
Dongze
After a few tests, I've come to the conclusion that it works on public GFFs. I'm now trying to see if this works on GFFs made from custom transcriptomic references such as mine. Thanks again!
Could do the first 2 steps on a custom GFF3 made from the genome and my transcriptomic reference (with GMAP 2020.06.01) but the last step (with dev version of pyroe) gave me a few files, but no GTF. The GTF I obtained from the previous step did not work for alevin at the indexing step with a --txp2gene parameter.
testing without --txp2gene and it's working smoothly so far (indexing worked this time!)
Tracking this and just wanted to mention here that the changes we made in pyroe
are now represented in version 0.6.3 which is on bioconda as well. If things work cleanly with this version, we can bump the version dependency to the published version.
@rob-p how can we fix this using nextflow tower? thank you
Can this issue be closed?
Yes, bugfix is on dev now
Description of the bug
Hello, I am testing the alevin-fry branch and came across this error message at the indexing step:
Error: pyroe failed to return succesfully ExitStatus(unix_wait_status(256))
As seen in our discussion in Slack, error message are not informative (here an excerpt):
Command used and terminal output
Relevant files
.command.log .nextflow.log
System information
Nextflow version: 22.04.0 Hardware: HPC Executor: Slurm Container: Singularity OS: CentOS Version of nf-core/scrnaseq: dev