mhalushka / miRge3.0

Comprehensive analysis of small RNA sequencing data
MIT License
28 stars 11 forks source link

Novel miRNA annotation hanging? #61

Closed Glfrey closed 1 year ago

Glfrey commented 1 year ago

Hello again,

I'm running the novel annotation pipeline using the following commands:

miRge3.0 -rr -s <file_dir> -lib /miRge3_Lib/ -db mirbase -o mirbase_novel_dir -cpu 6 -on human -ie -ai -mEC -nmir

However after 2 days, the pipeline is still stuck on the following process:

Predicting novel miRNAs

I've ran this twice now with the same result. I understand novel miRNA discovery is likely to take some time to compute, I just wondered if this kind of time scale was normal, or if the program is hanging in some way. The amount of memory being used generally stays at around 23Gb for the discovery process for the full 2 days.

arunhpatil commented 1 year ago

Hi @Glfrey,

I had previously not tested resume function with nmir, I believe the -mEC can't be used with resume function, since error correction happens before the collapsing of reads. Can you try this command again with just -nmir?

miRge3.0 -rr -s <file_dir> -lib /miRge3_Lib/ -db mirbase -o mirbase_novel_dir -cpu 6 -on human -nmir

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil

Doh! Of course, that would make sense. I'm running it now and I'll let you know what it does. Thank you for replying so rapidly!

Gill

Glfrey commented 1 year ago

Hi @arunhpatil,

I left it overnight and it's still on:

`Predicting novel miRNAs

Performing prediction of novel miRNAs... Start to predict`

Is this expected or indicative of an error. The program has been steadily using 21GB of RAM since the prediction started.

arunhpatil commented 1 year ago

Hi @Glfrey,

This is not expected unless there are too many samples to analyze!. I will test out a sample on my end and get back to you. I apologize for the inconvenience.

Thank you, Arun.

mhalushka commented 1 year ago

I agree with what Arun wrote. Can you give us a sense of how large the input FASTQ file was? Also, have you run this without novel miRNA detection to confirm that the file can be processed? And if so, can you give us a sense of how large the unmapped.csv file was? That can help inform whether the software is hanging (likely) or overwhelmed (possible). The novel miRNA discovery is slow.

arunhpatil commented 1 year ago

Hi @Glfrey,

I have tried with two samples. The run was successful, please see the commands below.

Can you obtain these two files (SRR772403, SRR772404) from NCBI and try them on your machine? If this fails then we can figure out why.

 miRge3.0 -s SRR772403.fastq.gz,SRR772404.fastq.gz -a illumina -lib /mnt/d/Halushka_lab/Arun/GTF_Repeats_miRge2to3/miRge3_Lib/revised_hsa -on human -db mirbase -o output_dir -spl
bowtie version: 1.3.1
cutadapt version: 4.1
Samtools version: 1.6
Collecting and validating input files...

miRge3.0 will process 2 out of 2 input file(s).

Cutadapt finished for file SRR772403 in 6.4198 second(s)
Collapsing finished for file SRR772403 in 0.0403 second(s)

Cutadapt finished for file SRR772404 in 15.0359 second(s)
Collapsing finished for file SRR772404 in 0.3675 second(s)

Matrix creation finished in 0.1452 second(s)

Data pre-processing completed in 22.4049 second(s)

Alignment in progress ...
Alignment completed in 10.4587 second(s)

Summarizing and tabulating results...
Summary completed in 1.0301 second(s)

The path to ourput directory: /mnt/d/Halushka_lab/Arun/datasets/output_dir/miRge.2023-01-12_20-02-41

The analysis completed in 34.7819 second(s)

Resuming to utilize nmir function

miRge3.0 -s /mnt/d/Halushka_lab/Arun/datasets/output_dir/miRge.2023-01-12_20-02-41  -lib miRge3_Lib -on human -db mirbase -o output_dir -rr -nmir
bowtie version: 1.3.1
cutadapt version: 4.1
Samtools version: 1.6
RNAfold version: 2.4.14
Collecting and validating input files...

miRge3.0 will process 2 saved run(s) from binary pickle file.

Alignment in progress ...
Alignment completed in 11.2452 second(s)

Summarizing and tabulating results...
Summary completed in 0.9763 second(s)

Predicting novel miRNAs

Performing prediction of novel miRNAs...
Start to predict
Prediction of novel miRNAs Completed (63.73 sec)

The path to ourput directory: /mnt/d/Halushka_lab/Arun/datasets/output_dir/miRge.2023-01-12_20-05-10

The analysis completed in 79.5234 second(s)
arunhpatil commented 1 year ago

Hi @Glfrey,

The earlier samples didn't yield novel miRNAs may because they were samll. I tried, two other samples and it resulted in few novel miRNAs. You may try them instead.

miRge3.0 -s SRR9856179.fastq.gz -a AGATCGGAAGAGCACACGTCTGAACTCC -lib /mnt/d/Halushka_lab/Arun/GTF_Repeats_miRge2to3/miRge3_Lib/revised_hsa -on human -db mirbase -o output_dir -spl

and

miRge3.0 -s SRR8487219.fastq.gz -a illumina -lib /mnt/d/Halushka_lab/Arun/GTF_Repeats_miRge2to3/miRge3_Lib/revised_hsa -on human -db mirbase -o output_dir -spl

Thank you, Arun

Glfrey commented 1 year ago

Hi @arunhpatil and @mhalushka,

Thank you for your help so far. I have ran the pipeline without novel miRNA detection and it runs without fault and very quickly. It could be that my data is too large as the files range from 9-20GB (uncompressed) and there's 18 files in total. The unmapped.csv is 1.5Gb.

Interestingly running the pipeline with error correction and novel miRNA detection induces a fault:

Command:

miRge3.0 -s /Users/gillianreynolds/Documents/smRNA2/raw2 -lib /Users/gillianreynolds/Documents/smRNA2/mirge/miRge3_Lib/ -db mirbase -o mirbase_novel_dir -cpu 6 -on human -nmir -mEC

Fault:

Traceback (most recent call last):
  File "/Users/gillianreynolds/Library/Python/3.7/bin/miRge3.0", line 10, in <module>
    sys.exit(main())
  File "/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/__main__.py", line 138, in main
    pdDataFrame,sampleReadCounts,trimmedReadCounts,trimmedReadCountsUnique = bakingEC(args, fastq_fullPath, base_names, workDir)
  File "/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/digestEC.py", line 236, in bakingEC
    run_merEC(args, kmc_exe, kmc_dump_exe, miREC_fq_exe, i)
  File "/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/digestEC.py", line 107, in run_merEC
    kmc_EC = subprocess.run(str(kmcExec), shell=True, check=True, stdout=subprocess.PIPE, text=True, stderr=subprocess.PIPE, universal_newlines=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc -k15 -fq -ci1 ./correct_read.fastq tmp15 ./' returned non-zero exit status 126.

I'll run the other files with the commands you supplied and I'll let you know what happens.

arunhpatil commented 1 year ago

Hi @Glfrey,

The sample size of 18, could be the reason for it being slow. With regard to -mEC and -nmir, this combination of parameters shouldn't cause errors. If you have run with -mEC in the past, and the software stopped/forcefully stopped then there are intermediate files that are generated. These files are in the same execution directory. When you start miRge3.0 afresh and use -mEC, those intermediate files will cause these problems. The simple answer is, if you find any files as listed below, please delete them and re-run miRge3.0 with both -mEC and -nmir option.

correct_read.fastq, input.fq, ID_read_quality_cor.txt, expreLevel_cor.txt, ID_read_quality_input.txt, id_read.txt, *.freq, changed_detail.txt, changed_list.txt, and tmp*

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

I downloaded the two files you provided in your second response (SRR9856179 and SRR8487219) and mirge still hangs with no progress for the novel miRNA detection. I used the following commands:

miRge3.0 -s SRR9856179.fastq -a AGATCGGAAGAGCACACGTCTGAACTCC -lib /Users/gillianreynolds/Documents/smRNA2/mirge/miRge3_Lib/ -on human -db mirbase -o output_dir -spl

miRge3.0 -s /Users/gillianreynolds/Documents/test/output_dir/miRge.2023-01-14_18-35-17 -lib /Users/gillianreynolds/Documents/smRNA2/mirge/miRge3_Lib/ -on human -db mirbase -o output_dir -rr -nmir

and received this output:

miRge3.0 -s /Users/gillianreynolds/Documents/test/output_dir/miRge.2023-01-14_18-35-17 -lib /Users/gillianreynolds/Documents/smRNA2/mirge/miRge3_Lib/  -on human -db mirbase -o output_dir -rr -nmir
bowtie version: 1.3.0
cutadapt version: 4.2
Samtools version: 1.16.1
RNAfold version: 2.1.9
Collecting and validating input files...

miRge3.0 will process 1 saved run(s) from binary pickle file.

Alignment in progress ...
Alignment completed in 52.7066 second(s)

Summarizing and tabulating results...
Summary completed in 4.8962 second(s)

Predicting novel miRNAs

Performing prediction of novel miRNAs...
Start to predict

I also tried to predict without resuming which also hung with the same memory usage as above.

miRge3.0 -s SRR9856179.fastq -a AGATCGGAAGAGCACACGTCTGAACTCC -lib /Users/gillianreynolds/Documents/smRNA2/mirge/miRge3_Lib/ -on human -db mirbase -o output_dir_combined -spl -nmir
bowtie version: 1.3.0
cutadapt version: 4.2
Samtools version: 1.16.1
RNAfold version: 2.1.9
Collecting and validating input files...

miRge3.0 will process 1 out of 1 input file(s).

Cutadapt finished for file SRR9856179 in 65.7478 second(s)
Collapsing finished for file SRR9856179 in 1.2663 second(s)

Matrix creation finished in 1.2057 second(s)

Data pre-processing completed in 69.5889 second(s)

Alignment in progress ...
Alignment completed in 55.3385 second(s)

Summarizing and tabulating results...
Summary completed in 5.1632 second(s)

Predicting novel miRNAs

Performing prediction of novel miRNAs...
Start to predict

There is no progress past this point and the memory usage is stuck at around 9GB.

I can think of a few potential problems which may help with your troubleshooting:

  1. I'm running a Mac with an M1 chip
  2. I had to install RNA fold via conda.

I'm aware conda now supports M1 chips but I also know that the authors of RNA fold have had issues with the M1 Macs:

"At the moment, our pre-compiled installer doesn't work for recent models based on Apples M1 Chip. Users who want to install the ViennaRNA Package on their Mac computer with Apple silicon M1 chip are advised to compile and install from source code." from https://www.tbi.univie.ac.at/RNA/#download

However I couldn't install from source due to various errors so I went with conda. I'm going to try and see if I can solve those errors to rule out it being a conda-related RNA fold problem and I'll let you know how I get on.

arunhpatil commented 1 year ago

Hi @Glfrey,

I am sorry, I faced problems with Mac in the past, when I was testing miRge3.0. While this is for good reason, Mac decided to choose different path of work around compared to Windows/Linux due to security reasons. I am not sure if this is due to the chip, however, I want test few things to confirm.

I have saved temporary files that are dealt with RNAfold, I have uploaded the zip folder here (unmapped_tmp.zip). The commands are run in the backend by miRge, if you can run those commands directly and it doesn't hang then we get an idea.

There are many functions in that one parameter. I want to try this out first. This may help us debug further. (The attched linked files belong to SRA: SRR9856179)

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

Unfortunately it's hanging with that command and nothing is being printed to "SRR9856179_precursor_tmp_Glfrey.str". Interestingly, if I activate RNA-fold and input the read data directly, it works great:

(mirge_packages) gillianreynolds@Gillians-MacBook-Pro unmapped_tmp % RNAfold

Input string (upper or lower case); @ to quit
....,....1....,....2....,....3....,....4....,....5....,....6....,....7....,....8
UUUUUUGUGAAUUCUUCGAUAAUGGCCCAUUUGGGCAAAAAGCCGGUUAGCGGGGGCAGGCCUCCUAGGGAGAGGAGGGUGGAUGGAAUUAAGGGUGUUAGUCAUGU
length = 107
UUUUUUGUGAAUUCUUCGAUAAUGGCCCAUUUGGGCAAAAAGCCGGUUAGCGGGGGCAGGCCUCCUAGGGAGAGGAGGGUGGAUGGAAUUAAGGGUGUUAGUCAUGU
.........((((((((.((....((((....)))).....(((..........)))...((((((......)))))))).)).)))))).................
 minimum free energy = -26.00 kcal/mol

I don't know if this suggests a problem reading in or opening the file?

arunhpatil commented 1 year ago

Hi @Glfrey,

I hope, this is not the new method of using RNAfold in the recent versions. I couldn't find the mention of this in their change log. I use 2.4.14. Can you create an issue and mention the problem reading file? https://github.com/ViennaRNA/ViennaRNA/issues

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

I can certainly do that, although you might be a few versions ahead of me for RNAfold as mine is 2.1.9. I'll see if I can install the same version as you and try everything again. If that also fails I'll seek their advice and see what they say.

Glfrey commented 1 year ago

@arunhpatil , it worked! I upgraded RNAfold to your version (2.4.14) via conda and novel miRNA prediction worked beautifully. Evidently there's some issue with version 2.1.9 and either my computer architecture (M1) or mirge3.0 in general.

I'm just going to run it on my data to make sure there's nothing else going on but that's progress!

Glfrey commented 1 year ago

Hi @arunhpatil,

I got another error running it on my data. I think this is the same error as I posted before but I made sure there were no intermediate results in the directory by creating a new one and running the commands:

miRge3.0 -s /Users/gillianreynolds/Documents/smRNA2/raw2/ -lib /Users/gillianreynolds/Documents/smRNA2/mirge/miRge3_Lib/ -db mirbase -o mirbase_novel_dir -cpu 6 -on human -ie -ai -mEC -nmir
bowtie version: 1.3.1
cutadapt version: 4.2
Samtools version: 1.16.1
RNAfold version: 2.4.14
Collecting and validating input files...

WARNING: File /Users/gillianreynolds/Documents/smRNA2/raw2/.DS_Store is neither fastq or fastq.gz format!
Omitting file /Users/gillianreynolds/Documents/smRNA2/raw2/.DS_Store

WARNING: File /Users/gillianreynolds/Documents/smRNA2/raw2/.DS_Store is neither fastq or fastq.gz format!

miRge3.0 will process 18 out of 19 input file(s).

Traceback (most recent call last):
  File "/Users/gillianreynolds/Library/Python/3.7/bin/miRge3.0", line 10, in <module>
    sys.exit(main())
  File "/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/__main__.py", line 138, in main
    pdDataFrame,sampleReadCounts,trimmedReadCounts,trimmedReadCountsUnique = bakingEC(args, fastq_fullPath, base_names, workDir)
  File "/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/digestEC.py", line 236, in bakingEC
    run_merEC(args, kmc_exe, kmc_dump_exe, miREC_fq_exe, i)
  File "/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/digestEC.py", line 107, in run_merEC
    kmc_EC = subprocess.run(str(kmcExec), shell=True, check=True, stdout=subprocess.PIPE, text=True, stderr=subprocess.PIPE, universal_newlines=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc -k15 -fq -ci1 ./correct_read.fastq tmp15 ./' returned non-zero exit status 126.
arunhpatil commented 1 year ago

Hi @Glfrey,

Glad to know about RNAfold. The miREC function is returning no output, can you try this command (the last line of the error): /Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc -k15 -fq -ci1 ./correct_read.fastq tmp15 ./

Can you let me know the version release of miRge3.0, you can get that from here: conda list | grep "mirge" mirge3 0.1.1 pyh7cba7a3_0 bioconda

The miREC is most time consuming process. Can you try without that option and see if that runs completely?

Thank you, Arun

Glfrey commented 1 year ago

Hi @arunhpatil,

Sure, the first command gives:

/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc -k15 -fq -ci1 ./correct_read.fastq tmp15 ./
zsh: exec format error: /Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc

A quick google tells me that error could mean the binary isn't executable for my computational architecture. My guess is it's an M1 problem, but computational hardware is out of my wheelhouse so I could be wrong.

Mirge isn't installed via conda on my system as the dependencies clash with the dependencies for RNAFold so I can't have both in the same environment. It's installed via pip which gives the version:

mirge3.0 --version
3.0

I'm running it without miREC now, I'll let you know what happens.

Thank you!

arunhpatil commented 1 year ago

Hi @Glfrey,

Before we reinstall miREC, I want to check the permissions on that executable file. Can you print the output of ls -li /Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc.

If possible, try setting chmod 777 /Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/miREC_fq chmod 777 /Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc*

Then test /Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc -h. If this doesn't print anything, then we have to install that again. This is again OS specific (sorry). If the above doesn't work, then you can install miREC from here.

Ok. That should not be a problem in general, let me know how it goes with novel miRNA prediction. You can use pip list | grep "mirge" on Pip.

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

Sure thing.

/Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc.: No such file or directory Still got an exec error after chmods:

(mirge_packages) gillianreynolds@Gillians-MacBook-Pro miRge.2023-01-05_19-51-22 % /Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc -h zsh: exec format error: /Users/gillianreynolds/Library/Python/3.7/lib/python/site-packages/mirge/libs/kmc

Also, for the version: pip list | grep "mirge" mirge3 0.1.2

I'll install kmc now and let you know how it goes. The miRNA predictions I ran on my own data failed, but that's because it was killed for running out of memory so I'm running smaller tests now. I'll keep you updated.

Thank you!

Gill

arunhpatil commented 1 year ago

Hi @Glfrey,

Ok. I would recommend running files in a batches of 11 (I tested larger combinations, based on the computer, the memory chokes). If you can test it just for one sample, it would be faster to figure out. Also, if one of the sample has an error, the whole chain may fail.

If you want to combine the counts or RPMs later, you could use scripts from miROme. The input for this is, output_dir for miRge3.0 (which has many folders that has name and date stamps (eg. miRge.2023-01-19_11-35-48).

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

Unfortunately installing kmc for an M1 Mac is proving very challenging. As it happens my company is migrating some analysis to AWS for scalability purposes so it may be that I just have to perform kmc + novel miRNA annotation on there.

Your suggestion to splitting up the files is a good one, thank you. I'll let you know how I get on with using mirge3.0 on AWS.

arunhpatil commented 1 year ago

Hi @Glfrey,

I am sorry about OS architecture. However, this is not just for Mac, On windows I installed Ubuntu v22 on WSL2 (Windows subsystem for Linux). I had the same issue yesterday with miREC, when I changed it to WSL1, it worked. The issue was mentioned on MS team, they asked to change the version of WSL2 for other (common compiler) issue. I don't know how it will affect Mac in the future.

Please let me know if you need any help with AWS. (I have improved the performance of miRge3.0, and I am testing it). There is a slight improvement in speed and a better improvement towards memory usage. I will update your suggestion on resume function and make a new release (hopefully this week).

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

I can confirm Mirge3.0 runs successfully for all processes on Ubuntu (although I have a new error on a new Ubuntu server to share with you on another issue) so I'll close the issue now.

Thank you for all of your help.

Best wishes,

Gill