novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
109 stars 31 forks source link

Epinano Variants frozen. Not sure why. #101

Closed niradsp closed 2 years ago

niradsp commented 2 years ago

Here are my commands. I am running it two different ways. First, using the reference genome:

minimap2 --MD -t 50 -ax splice -k14 -uf /home/banskotan2/Reference/grch38/Homo_sapiens_HG38_GRCH38_104.fa combined_3.1.fastq.gz | samtools view -hbS -F 3844 - |samtools sort -@ 50 - combined_3.1.bam

So I am mapping using the -ax splice command, and mapping to the reference genome. The minimap2 version I am using is 2.14-r883.

Next, I ran the variants program as follows:

python /home/banskotan2/Tools/EpiNano/Epinano_Variants.py -n 10 -R /home/banskotan2/Reference/grch38/Homo_sapiens_HG38_GRCH38_104.fa -b /home/banskotan2/Projects/direct_rna_test/combined_3.1.bam.bam -s /home/banskotan2/Tools/EpiNano/misc/sam2tsv.jar --type g The program appears to be frozen.
Looking at the temporary directory, it seems to get frozen around chunks 230 to 250 for some reason. That last chunk is chromosome 19. It seems to me that for some reason it is not able to parse chromosomes starting with the number 2.

I am also running it differently by mapping it to a reference transcriptome:

minimap2 --MD -t 50 -ax map-ont /home/banskotan2/Reference/grch38/Homo_sapiens.GRCh38.cdna.all.fa combined_3.1.fastq.gz | samtools view -hbS -F 3844 - |samtools sort -@ 50 - combined_3.1_transcriptome.bam

And the variants command:

python /home/banskotan2/Tools/EpiNano/Epinano_Variants.py -n 10 -R /home/banskotan2/Reference/grch38/Homo_sapiens.GRCh38.cdna.all.fa -b /home/banskotan2/Projects/direct_rna_test/combined_3.1_transcriptome.bam.bam -s /home/banskotan2/Tools/EpiNano/misc/sam2tsv.jar --type t

The command above seems to be running as far as I can tell. It has been running for 16 hours or so. Does it take this long? The csv generated file is at 0 bytes, but I notice that the program is taking 5% (35 GB of memory).
So, I think the transcriptome epinano_variants is working, but not the genome epinano_variants.

Also, I tried it on the test data. It worked just fine.
So any ideas why it is frozen when I run with --type g?

Thank you in advance.

Huanle commented 2 years ago

Hi @niradsp ,

can you share with me your bam and reference files so that i can have a look into the problem? Can you split your bam file on reference IDs and give it a go in the meanwhile? Thanks.

niradsp commented 2 years ago

Hello @Huanle, Yes, I can send you the bam and reference files. How do I send them to you? The BAM file is 2.5 GB.

Also, I must mention that it actually worked when I split up the genome file by Chromosome. So I am not sure why it got stuck when I ran it on the entire data.

Also, I have another question. How should I split up the BAM file if I mapped to a reference transcriptome? When I mapped to a genome, I simply split it by chromosome. In case of the transcriptome, I need to split it by the ENST id. There are hundreds of thousands of these.
Any suggestions?

For the transcriptome, the Variants command worked, but the slide_variants.py is taking a long time. The file generated by epinano variants was 6 GB. It appears slide_variants will be 5 times bigger. It was at 22 GB yesterday. Today it got to 25 GB only, so it appears to have slowed down drastically.

Thanks, Nirad

Huanle commented 2 years ago

Hi @niradsp , you can share with me your bam file using google drive or even https://www.filemail.com/ though i am not sure if the later one is safe enough.

Have you run into any similar issue when processing bam file generated using transcriptome reference? If you have to go with splitting bam based on reference IDs, you can have multiple references in one file, this way you will not have too many small bam files while you do reduce the size of the input.

As for the slowness of slide_variant, you can divide the variants file into smaller ones and run the slide_variants script parallelly on them.

Regarding the large size of the csv files, I recommend converting them to parquet files. So i will add an extra component to enable this.

niradsp commented 2 years ago

Hello @Huanle

It looks like the best way to share the BAM file is through the BOX. Is there an email address that you can provide? I cannot share it publicly.

When processing transcriptome reference, it actually went smoothly. The only problem is that it took a long time due to file size, and as I mentioned, I wasn't sure about how to split it.

And the genome processing went smoothly if I split the file. If I don't split it, it is frozen. The server has lots of memory, so that is not a problem. Something else is going on.

Thanks, Nirad

Huanle commented 2 years ago

Hi @niradsp ,

You can send it through BOX to elzedliu@gmail.com I recommend you to do the splitting using pysam. I will write a snippet once i can find a crackle on my timetable.

Thanks for using epinano, I will definitely work to improve it.
Cheers - Huanle

niradsp commented 2 years ago

Thank you @Huanle. I sent you an email. Please let me know if it works for you.

Thanks, Nirad

lvclark commented 2 years ago

What was the solution? I am having a similar problem with Epinano_Variants.py hanging. The really perplexing part is that exactly where it hangs is different each time (generates a different number of .freq files). It worked on six of my samples but not the other six. Here is my script. I can probably share files by Box if needed. I am using the Docker image.

#!/bin/bash
#SBATCH -n 2
#SBATCH --mem=32G
#SBATCH -p hpcbio
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=lvclark@illinois.edu
#SBATCH -J epinano
#SBATCH --array=5-7,9-11
#SBATCH -D /home/groups/hpcbio/projects/redacted/redacted-oxfordRNA-2021Jun/src/slurm-out

# Script to retry epinano jobs that hung

cd /home/groups/hpcbio/projects/redacted/redacted-oxfordRNA-2021Jun/

Sample_ID=`head src/sample_list_May2022.txt -n $SLURM_ARRAY_TASK_ID | tail -n 1 | cut -f 1`
Cross=`head src/sample_list_May2022.txt -n $SLURM_ARRAY_TASK_ID | tail -n 1 | cut -f 4`

Guppy=results/guppy/2022-05-17_for_EpiNano
Date=2022-07-06
FlairDate=2022-06-14

Transcriptome=results/parent_transcriptomes/${FlairDate}/Cross${Cross}_transcriptome.fa

### Strict mode: http://redsymbol.net/articles/unofficial-bash-strict-mode/
set -euo pipefail
IFS=$'\n\t'

### Modules
module purge
module load singularity/3.8.1

### Run Epinano
mkdir -p results/epinano/${Date}

singularity exec -e epi12_latest.sif python3 /usr/local/bin/EpiNano/Epinano_Variants.py \
  -R $Transcriptome --type t \
  -b results/minimap/${Date}/${Sample_ID}_all_sorted.bam \
  -s /usr/local/bin/EpiNano/misc/sam2tsv.jar -n $SLURM_NTASKS
lvclark commented 2 years ago

I also tried Epinano_Variants.dev.py (using a local copy, but executing it from the container) but got a "file not found" error with it.