tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
77 stars 12 forks source link

Request for demo data & getting ValueError: 'read_name' is not in list when running nanocompore eventalign_collapse with nanopolishComp data #212

Closed kwonej0617 closed 1 year ago

kwonej0617 commented 1 year ago

Hi, @tleonardi and @lmulroney! Thank you for providing a great tool!

It is my first time using nanocompore. I got several questions after running nanocompore. I would really appreciate it if you could give me answers to those questions.

Question 1) I wonder if you have any demo data provided for testing if I installed nanocompore eventalign_collapse and nanocompore correctly. I downloaded NanopolishComp data and tried to run nanocompore eventalign_collapse with nanopolishComp/tests/data/nanopolish_reads_samples.tsv. I wanted to check the result of eventalign_collapse with nanopolishComp/tests/output/collapsed_nanopolish.tsv. However, I got an error as follows.

Traceback (most recent call last):
  File "/home/euijin.kwon-umw/miniconda3/envs/nanocompore/bin/nanocompore", line 8, in <module>
    sys.exit(main())
  File "/home/euijin.kwon-umw/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/__main__.py", line 175, in main
    args.func(args)
  File "/home/euijin.kwon-umw/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/__main__.py", line 258, in eventalign_collapse_
main
    e()
  File "/home/euijin.kwon-umw/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/Eventalign_collapse.py", line 131, in __call__
    raise E
  File "/home/euijin.kwon-umw/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/Eventalign_collapse.py", line 108, in __call__
    raise NanocomporeError(tb)
nanocompore.common.NanocomporeError: Traceback (most recent call last):
  File "/home/euijin.kwon-umw/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SuperParser.py", line 58, in __init__
    idx = self.colnames.index(field)
ValueError: 'read_name' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/euijin.kwon-umw/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/Eventalign_collapse.py", line 149, in __split_r
eads
    n_lines=self.__n_lines) as sp:
  File "/home/euijin.kwon-umw/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SuperParser.py", line 61, in __init__
    raise SuperParserError ("Required field `{}` not found in files `{}`".format(field, fn))
nanocompore.SuperParser.SuperParserError: Required field `read_name` not found in files `/home/euijin.kwon-umw/Euijin/nanopolishComp/tests/data/n
anopolish_reads_samples.tsv

The reason for this error is there's no read_name column in the header of input file. So I tried to run nanopolish to get the file with read_name. However, according to the nanocompore manual, I need sequencing_summary.txt.

nanopolish index -s {sequencing_summary.txt} -d {raw_fast5_dir} {basecalled_fastq}
nanopolish eventalign --reads {basecalled_fastq} --bam {aligned_reads_bam} --genome {transcriptome_fasta} --print-read-names --scale-events --samples > {eventalign_reads_tsv}

Question 2) I know nanopolish index can be run without sequencing summary file but is this essential for nanocompore? Question 3) Also, if I provide it in running nanopolish index, would it make read_name column?
Question 4) I want to reproduce nanocompre data. I saw there are several data (MOLM13, synthetic oligos, yeast) deposited in ENA. Where is the output of those data (m6a site)? I really appreciate your help!

lmulroney commented 1 year ago

@kwonej0617 thank you for your questions and interest in nanocompore.

I'll do my best to go through your questions.

1.) There is a small sample yeast dataset of eventalign collapsed files that you can use for testing sampcomp here https://github.com/tleonardi/nanocompore/tree/master/docs/demo/eventalign_files/yeast

You do not need the sequence_summary.txt file to get the read_names column in the eventalign.tsv file. The sequence_summary.txt file is used to speed up the indexing step, but the read names are derived from the fastq files directly. It is common for an empty read names column error to occur when the nanopolish eventalign parameter --print-read-names is not used, but from your example it appears that you did include it. So it's probably worth trying to rerun the nanopolish step.

2.) The sequencing summary file is not used for nanocompore 3.) The read_name column is only produced with the nanopolish eventalign optional argument --print-read-names with or without a sequencing_summary file. Make sure that you are using all of the following optional nanopolish eventalign arguments --print-read-names --scale-events --samples Nanopolish will run without these three, but they are mandatory for nanocompore eventalign_collapse to run properly. 4.) You can find the MOLM13 results here https://github.com/tleonardi/nanocompore_pipeline/tree/3a1db3b7fba4ec80b04b414b1402e9743821220a/results/nanocompore

I hope this was helpful, Logan

kwonej0617 commented 1 year ago

Thank you for your kind reply and answer to the questions, @lmulroney ! I really appreciate it.

I want to run nanocompore with MOLM13 data as you suggested. From your ENA repository (PRJEB44511, PRJEB35148), I found 4 fastq files for MOLM13 (MOLM13_WT_rep1.fastq.gz, MOLM13_WT_rep2.fastq.gz, MOLM13_METLL3_KD_rep1.fastq.gz and MOLM13_METLL3_KD_rep2.fastq.gz). Those fastq files are associated with the MOL13 results in https://github.com/tleonardi/nanocompore_pipeline/tree/3a1db3b7fba4ec80b04b414b1402e9743821220a/results/nanocompore, right?. To run nanopolish step, I need fast5 files but I can't find them. Could you please share the link if you have already deposited it?

Thank you so much!

lmulroney commented 1 year ago

Hi @kwonej0617, the fast5 files are saved as cram files for saving on ENA. The MOLM13 fast5 files are under the PRJEB35148 accession number as cram files.

Logan

kwonej0617 commented 1 year ago

Hi @lmulroney

Thank you for your reply. I downloaded cram files and used on2cram tool to convert from cram file to fast5 file using the command line below.

ont2cram reverse_converter -i polya-RNA_METTL3_WT_1.cram -o fast5/.

However, I got the messages in my error log file. environments; we do recommend installing miniconda in your /pi space to do so. [E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_1.cram' [E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_1.cram' Reads written: 615,840

I thought it would not be running, but I got fast5 files in my fast5 folder. Do you think I can use the decompressed fast5 files? If you think I have to figure out the index issue, could you please share the way to solve it?

Thank you for your help!

lmulroney commented 1 year ago

Hi @kwonej0617,

You will need to uncompress the cram files back into fast5 files before you can use them. Use the utility tool here to do the conversion.

https://github.com/EGA-archive/ont2cram

All the best, Logan

kwonej0617 commented 1 year ago

Hi @lmulroney Thank you for your reply! Yes. I meant I run ont2cram to convert .cram to fast5. It looks .cram files were successfully decompressed into fast5. However, I got the message in log file. Question 1) Is it okay to ignore the messages?

[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_KD_1.cram'
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_KD_1.cram'
Reads written: 645,565
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_KD_2.cram'
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_KD_2.cram'
Reads written: 2,004,174
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_1.cram'
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_1.cram'
Reads written: 615,840
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_2.cram'
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_2.cram'
Reads written: 1,215,006

Question 2) I want to preprocess the data with nextflow. I installed nextflow and prepared samples.txt as follows.

SampleName      Condition       DataPath
KD1     KD      nanocompore/MOLM13_KD_rep1/fast5
KD2     KD      nanocompore/MOLM13_KD_rep2/fast5
WT1     WT      nanocompore/MOLM13_WT_rep1/fast5
WT2     WT      nanocompore/MOLM13_WT_rep2/fast5

I made a copy of nextflow.config.template and saved it as nextflow.config. I changed the samples parameter, transcriptome_fasta = true, gtf = false, genome_fasta = false and target_trancripts = "nanocompore/reference/gencode.v37lift37.transcripts.fa. It looks like the all parameters in nextflow.config were already addressed so I kept the other parts as you addressed.

params {
  // Path to the sample description file
  samples = "samples.txt"

  // Provide either:
  //   1) transcriptome_fasta: a fasta file for the transcriptome
  // or
  //   1) genome_fasta: fasta file for the genome
  //   2) gtf: annotation of the transcriptome in GTF
  // If you use genome_fasta and gtf you *must*
  // set transcriptome_fasta to false.
  transcriptome_fasta = true

  // Path to the Ensembl GTF file for the reference
  // This is ignored if transcriptome_fasta is provided
  gtf = false

  // Path to the reference genome fasta file
  // This is ignored if transcriptome_fasta is provided
  genome_fasta = false

  // Path to a txt file with a list of transcript ids of
  // interest. Only transcripts in this list will be analysed
  // set to false to analyse the whole transcriptome
  target_trancripts = "nanocompore/reference/gencode.v37lift37.transcripts.fa"

Then, I run nextflow run nanocompore.nf. However, the job was killed and none of the output was generated. I checked nanocompore_pipeline github page and your protocol published in current protocols to get an idea of running nextflow. However, because it is my first time using nextflow I am not sure how to run it. Could you please help me?

I really appreciate your help!

lmulroney commented 1 year ago

Hi @kwonej0617

You need to add the path to the transcriptome reference for transcriptome_fasta, not set it to true. You only need to set transcritome_fasta to false when you're using a genome reference and a gtf file.

Logan

kwonej0617 commented 1 year ago

Hi @lmulroney As you said, I added the path to the transcriptome reference for transcriptome_fasta, transcriptome_fasta = "/home/euijin.kwon-umw/Euijin/nanocompore/reference/gencode.v37lift37.transcripts.fa" and run again.

However, I got this message in the log file. I also moved to the work directory and resumed the process, but I still got an error. Could you please help me with how to solve this problem? Thank you so much for your help!


N E X T F L O W  ~  version 22.10.7
Launching `nanocompore.nf` [serene_payne] DSL1 - revision: 3e75b53fd8
[-        ] process > guppy               -
[-        ] process > pycoQC              -
[-        ] process > index_transcriptome -

executor >  local (1)
[-        ] process > guppy                   [  0%] 0 of 4
[-        ] process > pycoQC                  -
[25/e53f37] process > index_transcriptome (1) [  0%] 0 of 1
[-        ] process > minimap                 -
[-        ] process > eventalign              -
[-        ] process > nanocompore             -
WARN: Operator `choice` is deprecated -- it will be removed in a future release
WARN: The `echo` directive has been deprecated - use to `debug` instead

executor >  local (5)
[3c/cf6981] process > guppy (2)               [  0%] 0 of 4
[-        ] process > pycoQC                  -
[25/e53f37] process > index_transcriptome (1) [  0%] 0 of 1
[-        ] process > minimap                 -
[-        ] process > eventalign              -
[-        ] process > nanocompore             -
WARN: Operator `choice` is deprecated -- it will be removed in a future release
WARN: The `echo` directive has been deprecated - use to `debug` instead
Error executing process > 'index_transcriptome (1)'

Caused by:
  Process `index_transcriptome (1)` terminated with an error exit status (127)

Command executed:

  samtools faidx reference_transcriptome.fa
Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.run: line 88: docker: command not found

Work dir:
  /home/euijin.kwon-umw/Euijin/nanocompore_pipeline/work/25/e53f3771df7c2a499c0290bf819851

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

WARN: Killing running tasks (2)

executor >  local (5)
[3c/cf6981] process > guppy (2)               [100%] 2 of 2, failed: 2
[-        ] process > pycoQC                  -
[25/e53f37] process > index_transcriptome (1) [100%] 1 of 1, failed: 1 ✘
[-        ] process > minimap                 -
[-        ] process > eventalign              -
[-        ] process > nanocompore             -
WARN: Operator `choice` is deprecated -- it will be removed in a future release
WARN: The `echo` directive has been deprecated - use to `debug` instead
Error executing process > 'index_transcriptome (1)'
Caused by:
  Process `index_transcriptome (1)` terminated with an error exit status (127)

Command executed:

  samtools faidx reference_transcriptome.fa

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.run: line 88: docker: command not found

Work dir:
  /home/euijin.kwon-umw/Euijin/nanocompore_pipeline/work/25/e53f3771df7c2a499c0290bf819851

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
lmulroney commented 1 year ago

Hi @kwonej0617,

It looks like the samtool faidx command failed. I'm not entirely sure why it would fail at that step, but here are some possible solutions.

  1. Download the transcriptome.fa.fai file or run the samtools command samtool faidx transcriptome.fa yourself before running the nextflow pipeline
  2. Make sure that you have write permissions in the directory where your saving the data/files
  3. If your running the pipeline on a cluster with a job scheduler, make sure that you include the --profile {pbspro, lsf, or aws} as appropriate for your cluster.
kwonej0617 commented 1 year ago

Hi @lmulroney. Thank you for your kind support. However, I am still getting the same issue. Could you please check my files? Here is my LSF file.

#! /bin/bash
#BSUB -L /bin/bash
#BSUB -J nanocompore
#BSUB -q long
#BSUB -o LSF/out.MOLM13_nextflow
#BSUB -e LSF/err.MOLM13_nextflow
#BSUB -n 16 -W 72:00
#BSUB -R span[hosts=1]
#BSUB -R rusage[mem=6000]
#BSUB -u Euijin.kwon@umassmed.edu

module load java/default
module load conda/init
source activate /home/euijin.kwon-umw/Euijin/env/nextflow
/home/euijin.kwon-umw/Euijin/nextflow run nanocompore.nf -c nextflow.config  --profile lsf

This is my configuration file.

params {
  // Path to the sample description file
  samples = "/home/euijin.kwon-umw/Euijin/nanocompore_pipeline/samples.txt"

  // Provide either:
  //   1) transcriptome_fasta: a fasta file for the transcriptome
  // or 
  //   1) genome_fasta: fasta file for the genome
  //   2) gtf: annotation of the transcriptome in GTF
  // If you use genome_fasta and gtf you *must*
  // set transcriptome_fasta to false.
  transcriptome_fasta = "/home/euijin.kwon-umw/Euijin/nanocompore/reference/gencode.v37lift37.transcripts.fa"

  // Path to the Ensembl GTF file for the reference
  // This is ignored if transcriptome_fasta is provided
  gtf = false

  // Path to the reference genome fasta file
  // This is ignored if transcriptome_fasta is provided
  genome_fasta = false

  // Path to a txt file with a list of transcript ids of 
  // interest. Only transcripts in this list will be analysed
  // set to false to analyse the whole transcriptome
  target_trancripts = false

  // Path to a file with transcript ids to
  // exclude from the analysis
  exclude_trancripts = false

  // If true run pycoQC
  qc=true

  // If true guppy saves basecalled fast5 files
  keep_basecalled_fast5=true

  // If true skip the basecalling step
  // the data paths in the sample file must point to Guppy output dirs
  input_is_basecalled=false

  // If true save to disk the eventalign output files
  keep_eventalign_files = true

  // Path to a folder where to store results
  resultsDir = "/home/euijin.kwon-umw/Euijin/nanocompore_pipeline/nextflow_results"

  // Nanocompore specific options 
  // Refer to Nanocompore documentation for info:
  // https://nanocompore.rna.rocks/
  sequenceContext=2
  pvalue_thr=0.05
  min_cov=1
  downsample_high_cov=5000
  nanocompore_loglevel="debug"

  // This must match one of the conditions in the samples file
  reference_condition = "WT"

  // Parameters for basecalling with Guppy
  flowcell = "FLO-MIN106"
  kit = "SQK-RNA002"
  min_qscore = 7

  // Switch on GPU basecalling
  GPU = "false"
  guppy_runners_per_device = 8
  guppy_chunks_per_runner = 1000
  guppy_chunk_size = 1000

  // Provide the location of the Docker/Singularity images
  // (can be either local or on docker hub)
  // Note that you can provide docker images even if you use
  // Singularity to run the pipeline.
  // Due to ONT licensing we can't provide a docker image with
  // Guppy. Please use the Dockerfile in the repository to build
  // a suitable image.
  if(GPU == "true"){
    guppy_container = "genomicpariscentre/guppy-gpu:4.5.3"
  }
  else{
    guppy_container = "genomicpariscentre/guppy:4.5.3"
  }
  pycoqc_container = "tleonardi/pycoqc:2.5.2"
  genomicstools_container = "tleonardi/genomics_tools:1.0.1"
  minimap2_container = "tleonardi/minimap2:2.17"
  eventalign_container = "tleonardi/nanocompore:v1.0.4"
  nanocompore_container = "tleonardi/nanocompore:v1.0.4"

  FPGA = "false"
}

docker.enabled = true
// singularity.enabled = true
// singularity.autoMounts = true

profiles {
  standard {
    includeConfig 'conf/local.conf'
  }

  lsf {
    includeConfig 'conf/lsf.conf'
  }

  pbspro {
    includeConfig 'conf/pbspro.conf'
  }

  aws {
    includeConfig 'conf/aws_batch.conf'
  }
}

Here is the part in nanocompore.nf where it is related to making fasta index file. (I didn't have any changes from your original code.)

if( params.transcriptome_fasta ) {
 transcriptome_bed = file('NO_FILE')
 process index_transcriptome {
    publishDir "${params.resultsDir}/references/", mode: 'copy'
    container "${params.genomicstools_container}"
    input:
      file 'reference_transcriptome.fa' from transcriptome_fasta_index
    output:
      file "reference_transcriptome.fa.fai" into transcriptome_fai_minimap

    """
    samtools faidx reference_transcriptome.fa
    """
  }

}

Thank you so much.

lmulroney commented 1 year ago

Everything appears to be set up properly to me. I don't know why you're getting a command not found error when trying to run samtools index within the pipeline. I would try to run that command before running the nextflow pipeline to make the transcriptome.fa.fai file and see if you can move on to the next step in the pipeline.