Error with pyroe at alevin-fry indexing step

Khajidu commented 1 year ago

Description of the bug

Hello, I am testing the alevin-fry branch and came across this error message at the indexing step: Error: pyroe failed to return succesfully ExitStatus(unix_wait_status(256))

As seen in our discussion in Slack, error message are not informative (here an excerpt):

Command output:
  found `salmon` in the PATH at /usr/local/bin/salmon
  found `alevin-fry` in the PATH at /usr/local/bin/alevin-fry
  found `pyroe` in the PATH at /usr/local/bin/pyroe

Command error:
  Error: pyroe failed to return succesfully ExitStatus(unix_wait_status(256))

Command used and terminal output

nextflow run nf-core/scrnaseq -resume --max_memory '500.GB' -r alevin-fry -c /shared/home/hmayeur/.nextflow/config --outdir /shared/ifbstor1/projects/bsbii/sc_single_cell_brain/fulltest_alevin/ --protocol '10XV3' --aligner alevin  --input 'test_samples_v2.csv' --genome_fasta /shared/projects/bsbii/sc_single_cell_brain/sc_ncbi_genome.fasta --gtf /shared/projects/bsbii/sc_single_cell_brain/sc_gene_models_ncbi_no_genes_no_contigs_notrnas.gtf -profile ifb_core
N E X T F L O W  ~  version 22.04.0
Launching `https://github.com/nf-core/scrnaseq` [desperate_lumiere] DSL2 - revision: 298233d17d [alevin-fry]

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/scrnaseq v2.0.1dev
------------------------------------------------------
Core Nextflow options
  revision                  : alevin-fry
  runName                   : desperate_lumiere
  containerEngine           : singularity
  launchDir                 : /shared/ifbstor1/projects/bsbii/sc_single_cell_brain
  workDir                   : /shared/ifbstor1/projects/bsbii/sc_single_cell_brain/work
  projectDir                : /shared/home/hmayeur/.nextflow/assets/nf-core/scrnaseq
  userName                  : hmayeur
  profile                   : ifb_core
  configFiles               : /shared/home/hmayeur/.nextflow/config, /shared/home/hmayeur/.nextflow/assets/nf-core/scrnaseq/nextflow.config, /shared/home/hmayeur/.nextflow/config

Input/output options
  input                     : test_samples_v2.csv
  outdir                    : /shared/ifbstor1/projects/bsbii/sc_single_cell_brain/fulltest_alevin/

Mandatory arguments
  protocol                  : 10XV3

Reference genome options
  genome_fasta              : /shared/projects/bsbii/sc_single_cell_brain/sc_ncbi_genome.fasta
  igenomes_ignore           : true
  gtf                       : /shared/projects/bsbii/sc_single_cell_brain/sc_gene_models_ncbi_no_genes_no_contigs_notrnas.gtf

Kallisto/BUS Options
  bustools_correct          : true

Institutional config options
  config_profile_description: The IFB core cluster profile
  config_profile_contact    : https://community.france-bioinformatique.fr
  config_profile_url        : https://ifb-elixirfr.gitlab.io/cluster/doc/cluster-desc/

Max job request options
  max_cpus                  : 56
  max_memory                : 500.GB
  max_time                  : 24d 20h 31m 24s

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/scrnaseq for your analysis please cite:

* The pipeline
  https://doi.org/10.5281/zenodo.3568187

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/scrnaseq/blob/master/CITATIONS.md
------------------------------------------------------
executor >  slurm (2)
[38/623f6b] process > NFCORE_SCRNASEQ:SCRNASEQ:INPUT_CHECK:SAMPLESHEET_CHECK (test_samples_v2.csv)                                  [100%] 1 of 1 ✔
[62/e77da4] process > NFCORE_SCRNASEQ:SCRNASEQ:FASTQC_CHECK:FASTQC (S3_002)                                                         [100%] 8 of 8, cached: 8 ✔
[41/5bec30] process > NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:SIMPLEAF_INDEX (sc_gene_models_ncbi_no_genes_no_contigs_notrnas.gtf) [  0%] 0 of 1
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:SIMPLEAF_QUANT                                                       -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:ALEVINQC                                                             -
executor >  slurm (2)
[38/623f6b] process > NFCORE_SCRNASEQ:SCRNASEQ:INPUT_CHECK:SAMPLESHEET_CHECK (test_samples_v2.csv)                                  [100%] 1 of 1 ✔
[62/e77da4] process > NFCORE_SCRNASEQ:SCRNASEQ:FASTQC_CHECK:FASTQC (S3_002)                                                         [100%] 8 of 8, cached: 8 ✔
[41/5bec30] process > NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:SIMPLEAF_INDEX (sc_gene_models_ncbi_no_genes_no_contigs_notrnas.gtf) [100%] 1 of 1, failed: 1 ✘
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:SIMPLEAF_QUANT                                                       -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:ALEVINQC                                                             -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD                                                           -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:CONCAT_H5AD                                                           -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT                                                         -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS                                                          -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:MULTIQC                                                                              -
Execution cancelled -- Finishing pending tasks before exit
executor >  slurm (2)
[38/623f6b] process > NFCORE_SCRNASEQ:SCRNASEQ:INPUT_CHECK:SAMPLESHEET_CHECK (test_samples_v2.csv)                                  [100%] 1 of 1 ✔
[62/e77da4] process > NFCORE_SCRNASEQ:SCRNASEQ:FASTQC_CHECK:FASTQC (S3_002)                                                         [100%] 8 of 8, cached: 8 ✔
[41/5bec30] process > NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:SIMPLEAF_INDEX (sc_gene_models_ncbi_no_genes_no_contigs_notrnas.gtf) [100%] 1 of 1, failed: 1 ✘
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:SIMPLEAF_QUANT                                                       -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:ALEVINQC                                                             -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD                                                           -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:CONCAT_H5AD                                                           -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT                                                         -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS                                                          -
[-        ] process > NFCORE_SCRNASEQ:SCRNASEQ:MULTIQC                                                                              -
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/scrnaseq] Pipeline completed with errors-
Pulling Singularity image https://depot.galaxyproject.org/singularity/multiqc:1.13--pyhdfd78af_0 [cache /shared/ifbstor1/projects/bsbii/sc_single_cell_brain/work/singularity/depot.galaxyproject.org-singularity-multiqc-1.13--pyhdfd78af_0.img]
WARN: Singularity cache directory has not been defined -- Remote image will be stored in the path: /shared/ifbstor1/projects/bsbii/sc_single_cell_brain/work/singularity -- Use env variable NXF_SINGULARITY_CACHEDIR to specify a different location
Error executing process > 'NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:SIMPLEAF_INDEX (sc_gene_models_ncbi_no_genes_no_contigs_notrnas.gtf)'

Caused by:
  Process `NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:SIMPLEAF_INDEX (sc_gene_models_ncbi_no_genes_no_contigs_notrnas.gtf)` terminated with an error exit status (1)

Command executed:

  # export required var
  export ALEVIN_FRY_HOME=.

  # prep simpleaf
  simpleaf set-paths

  # run simpleaf index
  simpleaf \
      index \
      --threads 6 \
      --fasta sc_ncbi_genome.fasta \
      --gtf sc_gene_models_ncbi_no_genes_no_contigs_notrnas.gtf \
      --rlen 91 \
      -o salmon

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SCRNASEQ:SCRNASEQ:SCRNASEQ_ALEVIN:SIMPLEAF_INDEX":
      simpleaf: $(simpleaf -V | tr -d '\n' | cut -d ' ' -f 2)
      salmon: $(salmon --version | sed -e "s/salmon //g")
  END_VERSIONS

Command exit status:
  1

Command output:
  found `salmon` in the PATH at /usr/local/bin/salmon
  found `alevin-fry` in the PATH at /usr/local/bin/alevin-fry
  found `pyroe` in the PATH at /usr/local/bin/pyroe

Command error:
  Error: pyroe failed to return succesfully ExitStatus(unix_wait_status(256))

Work dir:
  /shared/ifbstor1/projects/bsbii/sc_single_cell_brain/work/41/5bec30428f0246f7ea1b1329ac3feb

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Relevant files

.command.log .nextflow.log

System information

Nextflow version: 22.04.0 Hardware: HPC Executor: Slurm Container: Singularity OS: CentOS Version of nf-core/scrnaseq: dev

rob-p commented 1 year ago

Is it possible to have access to the genome and GTF file being used here? I realize the reads may be (I believe, is) currently private, but the genome and annotation should be all we need to reproduce this error.

Khajidu commented 1 year ago

Transferring ongoing, file is about 500K, pipeline also gives the same error message with the original NCBI GTF (genome is also publicly available on NCBI) Edit: cannot transfer GTF (unsupported type), all files are available on NCBI (Scyliorhinus canicula genome + GFF/GTF)

rob-p commented 1 year ago

I was able to dig a bit further upstream, and the error occurs when pyranges is trying to parse the GTF file. The error trace looks like this.

❯ pyroe make-splici --dedup-seqs GCF_902713615.1_sScyCan1.1_genomic.fna GCF_902713615.1_sScyCan1.1_genomic.gtf 100 sScyCan1_index
Traceback (most recent call last):
  File "/home/rob/.local/bin/pyroe", line 195, in <module>
    make_splici_txome(
  File "/home/rob/.local/lib/python3.9/site-packages/pyroe/make_splici_txome.py", line 255, in make_splici_txome
    gr = pr.read_gtf(gtf_path)
  File "/home/rob/.local/lib/python3.9/site-packages/pyranges/readers.py", line 311, in read_gtf
    gr = read_gtf_full(f, as_df, nrows, _skiprows, duplicate_attr)
  File "/home/rob/.local/lib/python3.9/site-packages/pyranges/readers.py", line 343, in read_gtf_full
    extra = _to_rows(df.Attribute)
  File "/home/rob/.local/lib/python3.9/site-packages/pyranges/readers.py", line 367, in to_rows
    rowdicts.append({k: v
  File "/home/rob/.local/lib/python3.9/site-packages/pyranges/readers.py", line 368, in <dictcomp>
    for k, v in [kv.replace('"', '').split(None, 1)
ValueError: not enough values to unpack (expected 2, got 1)

rob-p commented 1 year ago

As an update, I attempted to see if this was the result of a "poorly-formed" GTF file. I instead downloaded the GFF file and used gffread to convert it to a GTF file. Processing got much further, but terminated with a non-zero return code from bedtools. The error was:

Differing number of BED fields encountered at line: 7.  Exiting...

Indeed we can see this is the case:

NC_001950.1 5257    5941    rna-ATP6    .   +   S   gene-ATP6
NC_001950.1 5099    5267    rna-ATP8    .   +   S   gene-ATP8
NC_001950.1 2626    4180    rna-COX1    .   +   S   gene-COX1
NC_001950.1 4333    5024    rna-COX2    .   +   S   gene-COX2
NC_001950.1 5940    6726    rna-COX3    .   +   S   gene-COX3
NC_001950.1 11516   12660   rna-CYTB    .   +   S   gene-CYTB
NC_001950.1 1122    1192    rna-NC_001950.1:1123..1192  .   +   S
NC_001950.1 12660   12732   rna-NC_001950.1:12661..12732    .   +   S

There is a field missing from the records starting at line 7. I believe this is a result of these records not having an associated gene name. I think this may be a pyroe issue and in such cases we will want to just e.g. give the transcript a gene name equal to its transcript name. Of course, it doesn't suggest what is wrong with the original GTF file itself.

apeltzer commented 1 year ago

Thank you for digging further into this @rob-p ! So its technically something upstream that we cannot fix on a pipeline level I assume? Not sure if we want to open the can of worms to add functionality that does certain checks on the GFF/GTF provided, as this could also give users a certain "false sense of security" by making the workflow run but potentially also not doing the right thing. Not sure what we can / should do here on a pipeline level 🤔

rob-p commented 1 year ago

Indeed. @dongzehe is looking into this upstream. If it's something we can reliably fix then we'll probably bump pyroe and we can include the new version here. Otherwise, we'll just have to find a way to fix the GTF externally to process such samples.

rob-p commented 1 year ago

There were several issues at play here. The first was that the input GTF file cannot be parsed by pyranges at all. That is a problem upstream in pyranges and I'm not sure how to fix it. I think the best thing is to report it in that repo with a link to the offending GTF and let them figure out if (a) it's a non-conformance of their parser, or (b) it's a fundamental problem with he GTF file itself. The second issue is that, even after that is resolved, there are several records with no gene_id. Finally, there were some transcript records with no gene_id nor gene_name. We solved the first case by copying the gene_name as the gene_id if the latter is not present. We solved the second issue by using the transcript_id as the gene_id if neither a gene_id no gene_name is present.

@dongzehe and I have made a version of the GTF file that fixes all of the issues. That file can be obtained from here.

Here is how the cleaned GTF was arrived at

The following command was used to clean the GTF file (since pyranges would not parse it).

gffread -o new_gtf.gff GCF_902713615.1_sScyCan1.1_genomic.gff --gtf -E --keep-genes

This creates a parsable GTF file, but there are 2 problems. First, gffread adds gene- in front of the gene ID for no obvious reason, which just makes things more complicated. We fix that with the following command:

rg --passthru -N “\”gene-” -r “\”" new_gtf.gtf  > new_gtf.fixed.gtf

Finally, there are a small number ~24 records of transcripts that have no associated gene ID. So we have to augment those records with a corresponding gene_id so that they can be processed. This was performed using the newest version of pyroe , which lives in the development branch of the repository and should be released as 0.6.3 soon.

DongzeHE commented 1 year ago

Hi all,

Given the new_gtf.fixed.gtf file from the rg command above, the clean_gtf.gtf file @rob-p uploaded is from the following process:

If using the development branch of the repository (clone it and run pip install .), the clean_gtf.gtf can be obtained by running

pyroe make-splici GCF_902713615.1_sScyCan1.1_genomic.fna new_gtf.fixed.gtf 100 output_dir --write-clean-gtf

With the new_gtf.fixed.gtf file and pyroe 0.6.3, the splici transcriptome can be made with some warnings saying there are missing gene_id and gene_name values.
The --write-clean-gtf flag tells pyroe to export the imputed clean GTF file to the output_dir/clean_gtf.gtf, in which all missing gene_id and gene_name are imputed.
Here 100 in the command represents the read length. As we were not sure what the read length is, we use 100. It doesn't affect the clean_gtf.gtf. Feel free to change it.
Dongze

Khajidu commented 1 year ago

After a few tests, I've come to the conclusion that it works on public GFFs. I'm now trying to see if this works on GFFs made from custom transcriptomic references such as mine. Thanks again!

Khajidu commented 1 year ago

Could do the first 2 steps on a custom GFF3 made from the genome and my transcriptomic reference (with GMAP 2020.06.01) but the last step (with dev version of pyroe) gave me a few files, but no GTF. The GTF I obtained from the previous step did not work for alevin at the indexing step with a --txp2gene parameter.

Khajidu commented 1 year ago

testing without --txp2gene and it's working smoothly so far (indexing worked this time!)

rob-p commented 1 year ago

Tracking this and just wanted to mention here that the changes we made in pyroe are now represented in version 0.6.3 which is on bioconda as well. If things work cleanly with this version, we can bump the version dependency to the published version.

lfperales commented 1 year ago

@rob-p how can we fix this using nextflow tower? thank you

grst commented 1 year ago

Can this issue be closed?

apeltzer commented 1 year ago

Yes, bugfix is on dev now

nf-core / scrnaseq