Closed kwonej0617 closed 1 year ago
@kwonej0617 thank you for your questions and interest in nanocompore.
I'll do my best to go through your questions.
1.) There is a small sample yeast dataset of eventalign collapsed files that you can use for testing sampcomp here https://github.com/tleonardi/nanocompore/tree/master/docs/demo/eventalign_files/yeast
You do not need the sequence_summary.txt file to get the read_names column in the eventalign.tsv file. The sequence_summary.txt file is used to speed up the indexing step, but the read names are derived from the fastq files directly. It is common for an empty read names column error to occur when the nanopolish eventalign parameter --print-read-names is not used, but from your example it appears that you did include it. So it's probably worth trying to rerun the nanopolish step.
2.) The sequencing summary file is not used for nanocompore 3.) The read_name column is only produced with the nanopolish eventalign optional argument --print-read-names with or without a sequencing_summary file. Make sure that you are using all of the following optional nanopolish eventalign arguments --print-read-names --scale-events --samples Nanopolish will run without these three, but they are mandatory for nanocompore eventalign_collapse to run properly. 4.) You can find the MOLM13 results here https://github.com/tleonardi/nanocompore_pipeline/tree/3a1db3b7fba4ec80b04b414b1402e9743821220a/results/nanocompore
I hope this was helpful, Logan
Thank you for your kind reply and answer to the questions, @lmulroney ! I really appreciate it.
I want to run nanocompore with MOLM13 data as you suggested. From your ENA repository (PRJEB44511, PRJEB35148), I found 4 fastq files for MOLM13 (MOLM13_WT_rep1.fastq.gz, MOLM13_WT_rep2.fastq.gz, MOLM13_METLL3_KD_rep1.fastq.gz and MOLM13_METLL3_KD_rep2.fastq.gz). Those fastq files are associated with the MOL13 results in https://github.com/tleonardi/nanocompore_pipeline/tree/3a1db3b7fba4ec80b04b414b1402e9743821220a/results/nanocompore, right?. To run nanopolish step, I need fast5 files but I can't find them. Could you please share the link if you have already deposited it?
Thank you so much!
Hi @kwonej0617, the fast5 files are saved as cram files for saving on ENA. The MOLM13 fast5 files are under the PRJEB35148 accession number as cram files.
Logan
Hi @lmulroney
Thank you for your reply. I downloaded cram files and used on2cram tool to convert from cram file to fast5 file using the command line below.
ont2cram reverse_converter -i polya-RNA_METTL3_WT_1.cram -o fast5/.
However, I got the messages in my error log file. environments; we do recommend installing miniconda in your /pi space to do so. [E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_1.cram' [E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_1.cram' Reads written: 615,840
I thought it would not be running, but I got fast5 files in my fast5 folder. Do you think I can use the decompressed fast5 files? If you think I have to figure out the index issue, could you please share the way to solve it?
Thank you for your help!
Hi @kwonej0617,
You will need to uncompress the cram files back into fast5 files before you can use them. Use the utility tool here to do the conversion.
https://github.com/EGA-archive/ont2cram
All the best, Logan
Hi @lmulroney Thank you for your reply! Yes. I meant I run ont2cram to convert .cram to fast5. It looks .cram files were successfully decompressed into fast5. However, I got the message in log file. Question 1) Is it okay to ignore the messages?
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_KD_1.cram'
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_KD_1.cram'
Reads written: 645,565
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_KD_2.cram'
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_KD_2.cram'
Reads written: 2,004,174
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_1.cram'
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_1.cram'
Reads written: 615,840
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_2.cram'
[E::cram_index_load] Could not retrieve index file for 'polya-RNA_METTL3_WT_2.cram'
Reads written: 1,215,006
Question 2) I want to preprocess the data with nextflow. I installed nextflow and prepared samples.txt as follows.
SampleName Condition DataPath
KD1 KD nanocompore/MOLM13_KD_rep1/fast5
KD2 KD nanocompore/MOLM13_KD_rep2/fast5
WT1 WT nanocompore/MOLM13_WT_rep1/fast5
WT2 WT nanocompore/MOLM13_WT_rep2/fast5
I made a copy of nextflow.config.template and saved it as nextflow.config. I changed the samples parameter, transcriptome_fasta = true, gtf = false, genome_fasta = false and target_trancripts = "nanocompore/reference/gencode.v37lift37.transcripts.fa. It looks like the all parameters in nextflow.config were already addressed so I kept the other parts as you addressed.
params {
// Path to the sample description file
samples = "samples.txt"
// Provide either:
// 1) transcriptome_fasta: a fasta file for the transcriptome
// or
// 1) genome_fasta: fasta file for the genome
// 2) gtf: annotation of the transcriptome in GTF
// If you use genome_fasta and gtf you *must*
// set transcriptome_fasta to false.
transcriptome_fasta = true
// Path to the Ensembl GTF file for the reference
// This is ignored if transcriptome_fasta is provided
gtf = false
// Path to the reference genome fasta file
// This is ignored if transcriptome_fasta is provided
genome_fasta = false
// Path to a txt file with a list of transcript ids of
// interest. Only transcripts in this list will be analysed
// set to false to analyse the whole transcriptome
target_trancripts = "nanocompore/reference/gencode.v37lift37.transcripts.fa"
Then, I run nextflow run nanocompore.nf
.
However, the job was killed and none of the output was generated. I checked nanocompore_pipeline github page and your protocol published in current protocols to get an idea of running nextflow. However, because it is my first time using nextflow I am not sure how to run it. Could you please help me?
I really appreciate your help!
Hi @kwonej0617
You need to add the path to the transcriptome reference for transcriptome_fasta, not set it to true. You only need to set transcritome_fasta to false when you're using a genome reference and a gtf file.
Logan
Hi @lmulroney
As you said, I added the path to the transcriptome reference for transcriptome_fasta, transcriptome_fasta = "/home/euijin.kwon-umw/Euijin/nanocompore/reference/gencode.v37lift37.transcripts.fa"
and run again.
However, I got this message in the log file. I also moved to the work directory and resumed the process, but I still got an error. Could you please help me with how to solve this problem? Thank you so much for your help!
N E X T F L O W ~ version 22.10.7
Launching `nanocompore.nf` [serene_payne] DSL1 - revision: 3e75b53fd8
[- ] process > guppy -
[- ] process > pycoQC -
[- ] process > index_transcriptome -
executor > local (1)
[- ] process > guppy [ 0%] 0 of 4
[- ] process > pycoQC -
[25/e53f37] process > index_transcriptome (1) [ 0%] 0 of 1
[- ] process > minimap -
[- ] process > eventalign -
[- ] process > nanocompore -
WARN: Operator `choice` is deprecated -- it will be removed in a future release
WARN: The `echo` directive has been deprecated - use to `debug` instead
executor > local (5)
[3c/cf6981] process > guppy (2) [ 0%] 0 of 4
[- ] process > pycoQC -
[25/e53f37] process > index_transcriptome (1) [ 0%] 0 of 1
[- ] process > minimap -
[- ] process > eventalign -
[- ] process > nanocompore -
WARN: Operator `choice` is deprecated -- it will be removed in a future release
WARN: The `echo` directive has been deprecated - use to `debug` instead
Error executing process > 'index_transcriptome (1)'
Caused by:
Process `index_transcriptome (1)` terminated with an error exit status (127)
Command executed:
samtools faidx reference_transcriptome.fa
Command exit status:
127
Command output:
(empty)
Command error:
.command.run: line 88: docker: command not found
Work dir:
/home/euijin.kwon-umw/Euijin/nanocompore_pipeline/work/25/e53f3771df7c2a499c0290bf819851
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
WARN: Killing running tasks (2)
executor > local (5)
[3c/cf6981] process > guppy (2) [100%] 2 of 2, failed: 2
[- ] process > pycoQC -
[25/e53f37] process > index_transcriptome (1) [100%] 1 of 1, failed: 1 ✘
[- ] process > minimap -
[- ] process > eventalign -
[- ] process > nanocompore -
WARN: Operator `choice` is deprecated -- it will be removed in a future release
WARN: The `echo` directive has been deprecated - use to `debug` instead
Error executing process > 'index_transcriptome (1)'
Caused by:
Process `index_transcriptome (1)` terminated with an error exit status (127)
Command executed:
samtools faidx reference_transcriptome.fa
Command exit status:
127
Command output:
(empty)
Command error:
.command.run: line 88: docker: command not found
Work dir:
/home/euijin.kwon-umw/Euijin/nanocompore_pipeline/work/25/e53f3771df7c2a499c0290bf819851
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Hi @kwonej0617,
It looks like the samtool faidx command failed. I'm not entirely sure why it would fail at that step, but here are some possible solutions.
Hi @lmulroney. Thank you for your kind support. However, I am still getting the same issue. Could you please check my files? Here is my LSF file.
#! /bin/bash
#BSUB -L /bin/bash
#BSUB -J nanocompore
#BSUB -q long
#BSUB -o LSF/out.MOLM13_nextflow
#BSUB -e LSF/err.MOLM13_nextflow
#BSUB -n 16 -W 72:00
#BSUB -R span[hosts=1]
#BSUB -R rusage[mem=6000]
#BSUB -u Euijin.kwon@umassmed.edu
module load java/default
module load conda/init
source activate /home/euijin.kwon-umw/Euijin/env/nextflow
/home/euijin.kwon-umw/Euijin/nextflow run nanocompore.nf -c nextflow.config --profile lsf
This is my configuration file.
params {
// Path to the sample description file
samples = "/home/euijin.kwon-umw/Euijin/nanocompore_pipeline/samples.txt"
// Provide either:
// 1) transcriptome_fasta: a fasta file for the transcriptome
// or
// 1) genome_fasta: fasta file for the genome
// 2) gtf: annotation of the transcriptome in GTF
// If you use genome_fasta and gtf you *must*
// set transcriptome_fasta to false.
transcriptome_fasta = "/home/euijin.kwon-umw/Euijin/nanocompore/reference/gencode.v37lift37.transcripts.fa"
// Path to the Ensembl GTF file for the reference
// This is ignored if transcriptome_fasta is provided
gtf = false
// Path to the reference genome fasta file
// This is ignored if transcriptome_fasta is provided
genome_fasta = false
// Path to a txt file with a list of transcript ids of
// interest. Only transcripts in this list will be analysed
// set to false to analyse the whole transcriptome
target_trancripts = false
// Path to a file with transcript ids to
// exclude from the analysis
exclude_trancripts = false
// If true run pycoQC
qc=true
// If true guppy saves basecalled fast5 files
keep_basecalled_fast5=true
// If true skip the basecalling step
// the data paths in the sample file must point to Guppy output dirs
input_is_basecalled=false
// If true save to disk the eventalign output files
keep_eventalign_files = true
// Path to a folder where to store results
resultsDir = "/home/euijin.kwon-umw/Euijin/nanocompore_pipeline/nextflow_results"
// Nanocompore specific options
// Refer to Nanocompore documentation for info:
// https://nanocompore.rna.rocks/
sequenceContext=2
pvalue_thr=0.05
min_cov=1
downsample_high_cov=5000
nanocompore_loglevel="debug"
// This must match one of the conditions in the samples file
reference_condition = "WT"
// Parameters for basecalling with Guppy
flowcell = "FLO-MIN106"
kit = "SQK-RNA002"
min_qscore = 7
// Switch on GPU basecalling
GPU = "false"
guppy_runners_per_device = 8
guppy_chunks_per_runner = 1000
guppy_chunk_size = 1000
// Provide the location of the Docker/Singularity images
// (can be either local or on docker hub)
// Note that you can provide docker images even if you use
// Singularity to run the pipeline.
// Due to ONT licensing we can't provide a docker image with
// Guppy. Please use the Dockerfile in the repository to build
// a suitable image.
if(GPU == "true"){
guppy_container = "genomicpariscentre/guppy-gpu:4.5.3"
}
else{
guppy_container = "genomicpariscentre/guppy:4.5.3"
}
pycoqc_container = "tleonardi/pycoqc:2.5.2"
genomicstools_container = "tleonardi/genomics_tools:1.0.1"
minimap2_container = "tleonardi/minimap2:2.17"
eventalign_container = "tleonardi/nanocompore:v1.0.4"
nanocompore_container = "tleonardi/nanocompore:v1.0.4"
FPGA = "false"
}
docker.enabled = true
// singularity.enabled = true
// singularity.autoMounts = true
profiles {
standard {
includeConfig 'conf/local.conf'
}
lsf {
includeConfig 'conf/lsf.conf'
}
pbspro {
includeConfig 'conf/pbspro.conf'
}
aws {
includeConfig 'conf/aws_batch.conf'
}
}
Here is the part in nanocompore.nf
where it is related to making fasta index file. (I didn't have any changes from your original code.)
if( params.transcriptome_fasta ) {
transcriptome_bed = file('NO_FILE')
process index_transcriptome {
publishDir "${params.resultsDir}/references/", mode: 'copy'
container "${params.genomicstools_container}"
input:
file 'reference_transcriptome.fa' from transcriptome_fasta_index
output:
file "reference_transcriptome.fa.fai" into transcriptome_fai_minimap
"""
samtools faidx reference_transcriptome.fa
"""
}
}
Thank you so much.
Everything appears to be set up properly to me. I don't know why you're getting a command not found error when trying to run samtools index within the pipeline. I would try to run that command before running the nextflow pipeline to make the transcriptome.fa.fai file and see if you can move on to the next step in the pipeline.
Hi, @tleonardi and @lmulroney! Thank you for providing a great tool!
It is my first time using nanocompore. I got several questions after running nanocompore. I would really appreciate it if you could give me answers to those questions.
Question 1) I wonder if you have any demo data provided for testing if I installed nanocompore eventalign_collapse and nanocompore correctly. I downloaded NanopolishComp data and tried to run nanocompore eventalign_collapse with nanopolishComp/tests/data/nanopolish_reads_samples.tsv. I wanted to check the result of eventalign_collapse with nanopolishComp/tests/output/collapsed_nanopolish.tsv. However, I got an error as follows.
The reason for this error is there's no read_name column in the header of input file. So I tried to run nanopolish to get the file with read_name. However, according to the nanocompore manual, I need sequencing_summary.txt.
Question 2) I know nanopolish index can be run without sequencing summary file but is this essential for nanocompore? Question 3) Also, if I provide it in running nanopolish index, would it make read_name column?
Question 4) I want to reproduce nanocompre data. I saw there are several data (MOLM13, synthetic oligos, yeast) deposited in ENA. Where is the output of those data (m6a site)? I really appreciate your help!