How to run Epinano_Variants.py if the reference genome is large ?

SEVEN-XYCHEN commented 1 year ago

Hello, I saw that you mentioned "Note 2: the users should split the computations for each reference sequences if the reference genome is large" in the manual. Now I happen to have encountered this problem. The ref.fa file is 1.8G, and my program did not report an error but stopped running. How should I solve it? Thank you very much for taking the time to answer my question! Looking forward to your reply.

enovoa commented 1 year ago

Hi @SEVEN-XYCHEN, if the program stopped running likely due to memory issues, please relaunch the job as stated in Note2, i.e. splitting the computations for each reference sequence (e.g. subset of transcripts, of by chromosomes). You may wish to first just run one chromosome (e.g. reads mapped to chr21) to see if memory was the issue. Best, Eva

SEVEN-XYCHEN commented 1 year ago

Hi @enovoa , Thank you for your help! I relaunch the job based on your suggestion (e.g. reads mapped to chr1), and it ran successfully. But now I have new ideas about this issue and I hope to receive your help. I ran the following code on chr1:

ref=transcriptome_chr1.fa
fq=pass.fq.gz
minimap2 -t 16 -ax map-ont -uf -k14 --MD $ref $fq | samtools view -Sb - > chr1.bam
samtools sort -@ 8 -m 4G -o chr1.sorted.bam chr1.bam

python Epinano_Variants.py -R $ref -b chr1.sorted.bam -s misc/sam2tsv.jar -n 8 -T t

Then, I got the chr1.sorted.plus_strand.per.site.csv file. The first question is that I should run Epinano_Variants.py on each chromosome, obtain multiple *.plus_strand.per.site.csv files, merge them into one csv file (e.g. merge.csv), and then run the next step on the merge.csv file (e.g. python misc/Slide_Variants.py merge.csv 5), then I only get a 5mer.csv file. Is there still a memory issue like this? Alternatively, run python misc/Slide_Variants.py *.plus_strand.per.site.csv 5 on the csv files of each chromosome and merge the *.5mer.csv files. Which method is more reasonable? The second question is, when I use minimap2 for mapping, should I only map to one chromosome to get the bam file, as I provided the example of chr1 above, or should I map to the whole transcriptome to get the bam file, and then conduct subsequent analysis?

Thanks again, Chen

enovoa commented 1 year ago

Hi Chen, for your first question, both options should in principle work. About your second question, Epinano_variants scripts requires a BAM. This is what you should split before running Epinano_variants.py. See usage below:

python Epinano_Variants.py -h
usage: Epinano_Variants.py [-h] [-R REFERENCE] [-b BAM] [-s SAM2TSV]
                           [-n NUMBER_CPUS] [-T TYPE]

optional arguments:
  -h, --help            show this help message and exit
  -n NUMBER_CPUS, --number_cpus NUMBER_CPUS
                        number of CPUs
  -T TYPE, --type TYPE  reference types, which is either g(enome) or
                        t(ranscriptome);

Required Arguments:
  -R REFERENCE, --reference REFERENCE
                        reference file indexed with samtools faidx and with
                        sequence dictionary created using picard
                        CreateSequenceDictionary
  -b BAM, --bam BAM     bam file; if given; no need to offer reads file;
                        mapping will be skipped
  -s SAM2TSV, --sam2tsv SAM2TSV
                        /path/to/sam2tsv.jar; needed unless a sam2tsv.jar
                        produced file is already given

Thanks Eva

SEVEN-XYCHEN commented 1 year ago

Hi @enovoa , Thank you. I have already understood the first question.The second question may be that I didn't express it clearly. What I want to know is the way I get the BAM file. The first way is to map it to the reference transcriptome of chromosome 1, so I split the reference transcriptome file (e.g. transcriptome_chr1.fa), and later I use the split reference transcriptome (e.g. python Epinano_Variants.py -R $ref). The second way is to map it to the complete reference transcriptome, and then extract the bam file mapped to chromosome 1, so I split the bam file, and later use the complete reference transcriptome. As shown in the following code: The first method:

ref=transcriptome_chr1.fa
fq=pass.fq.gz
minimap2 -t 16 -ax map-ont -uf -k14 --MD $ref $fq | samtools view -Sb - > chr1.bam
python Epinano_Variants.py -R $ref -b chr1.bam -s misc/sam2tsv.jar -n 8 -T t

The second method:

ref=transcriptome.fa
fq=pass.fq.gz
minimap2 -t 16 -ax map-ont -uf -k14 --MD $ref $fq | samtools view -Sb - > all.bam
samtools view -hb all.bam chr1 > chr1.bam

python Epinano_Variants.py -R $ref -b chr1.bam -s misc/sam2tsv.jar -n 8 -T t

I'm not sure what the "split" you mentioned in NOTE 2 specifically refers to? Because I found that the two methods I used above have different results. Thanks, Chen

enovoa commented 1 year ago

I would not map reads to separate fasta files but rather split the bam per chromosome (so option 2), but this is unrelated to EpiNano itself, it is just my own reasoning, because I would think that the effects of mapping to partial genomes/transcriptomes that will lead to multimapping issues, i.e. the "correct" mapping location for a given read is not now in your reference if you map to partial references, so a given read may find a suboptimal mapping option .

SEVEN-XYCHEN commented 1 year ago

My thoughts are also consistent with yours. Thank you very much for your help! Best, Chen

enovoa commented 1 year ago

There is now a slimmer version of EpiNano_Variants in release 1.2.2 and later. Which EpiNano version were you using? Can you please give it a go to this updated version? Thanks

novoalab / EpiNano

How to run Epinano_Variants.py if the reference genome is large ? #135