itmat / CAMPAREE

Configurable And Modular Program Allowing RNA Expression Emulation
GNU General Public License v3.0
5 stars 1 forks source link

BEAGLE error: Exception in thread "main" java.lang.IllegalArgumentException: NaN #7

Closed marzie-rasekh closed 11 months ago

marzie-rasekh commented 11 months ago

When running CAMPAREE, I get an error from BEAGLE.

It looks like beagle is complaining that "ERROR: there is only one sample". I got this by using a newer version of beagle. How can I run CAMPAREE on single sample RNA-seq data (only one pair of fastq files)?

Here are the err messages:

BeagleStep.log (exit code 1):

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:           0
Study samples:               1

Window 1 (chr1:14498-39975388)
Study markers:          36,017

Burnin  iteration 1:           1 second
Burnin  iteration 2:           1 second
Burnin  iteration 3:           1 second
Burnin  iteration 4:           1 second
Burnin  iteration 5:           2 seconds
Exception in thread "main" java.lang.IllegalArgumentException: NaN
    at phase.PhaseData.<init>(PhaseData.java:76)
    at main.MainHelper.lsPhaseSingles(MainHelper.java:94)
    at main.MainHelper.phase(MainHelper.java:72)
    at main.Main.phaseData(Main.java:166)
    at main.Main.main(Main.java:116)

*****STDERR:
None

and BeagleStep.serial.err :


Traceback (most recent call last):
  File "/home/mrasekh/git/BEERS2/CAMPAREE/camparee/beagle.py", line 79, in execute
    beagle_result = subprocess.run(command, shell=True, check=True,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mrasekh/.install/mamba/envs/beers2/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'java -jar ~/git/BEERS2/CAMPAREE/third_party_software/beagle.28Sep18.793.jar gt=/data/camparee/run_
1/CAMPAREE/data/all_variants.vcf out=/data/output/camparee/run_1/CAMPAREE/data/beagle seed=1890316565 nthreads=36' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/git/BEERS2/CAMPAREE/camparee/beagle.py", line 231, in <module>
    sys.exit(BeagleStep.main())
             ^^^^^^^^^^^^^^^^^
  File "/home/mrasekh/git/BEERS2/CAMPAREE/camparee/beagle.py", line 188, in main
    beagle_step.execute(beagle_jar_path=args.beagle_jar_path,
  File "/home/mrasekh/git/BEERS2/CAMPAREE/camparee/beagle.py", line 90, in execute
    raise CampareeException(f"\nBeagle process failed. "
camparee.camparee_utils.CampareeException: 
Beagle process failed. For full details see /illumina-isi07/scratch/dragen_team_share2/users/mrasekh/RNA_benchmarking/data/Human/real/giab_HG005/camparee/run_1/CAMPAREE/logs/BeagleStep.log
brainfood commented 11 months ago

Hi and thanks for your interest in BEERS2 and CAMPAREE,

Unfortunately, you can't currently run CAMPAREE with a single sample. As you've found, this is because we include a genetic phasing step that requires as least two samples. I'm in the process of patching CAMPAREE so it can skip the Beagle step. Thanks for your patience.

brainfood commented 11 months ago

I've patched CAMPAREE to skip the phasing step if the user provides only one sample. Would you be willing to download the version of CAMPAREE in the 'develop' branch (commit: 3fd75eaf93e6b736c551937c2535a71606b01e24) and confirm that it works with your data?

If it works for you, I'll release the patch to the main branch. Thanks!

marzie-rasekh commented 11 months ago

I ran it. This time it failed at the MoleculeMakerStep with error:

MoleculeMakerStep.serial.err :

Traceback (most recent call last):
  File "/home/mrasekh/git/CAMPAREE/camparee/molecule_maker.py", line 726, in <module>
    sys.exit(MoleculeMakerStep.main())
  File "/home/mrasekh/git/CAMPAREE/camparee/molecule_maker.py", line 718, in main
    molecule_maker.execute(sample=sample,
  File "/home/mrasekh/git/CAMPAREE/camparee/molecule_maker.py", line 443, in execute
    [read_fasta(os.path.join(sample_data_directory,
  File "/home/mrasekh/git/CAMPAREE/camparee/molecule_maker.py", line 443, in <listcomp>
    [read_fasta(os.path.join(sample_data_directory,
  File "/home/mrasekh/git/BEERS_UTILS/beers_utils/read_fasta.py", line 35, in read_fasta
    raise ValueError(f"Invalid characters found in the fasta file {fasta_file}: all must be in ACGTN")
ValueError: Invalid characters found in the fasta file /run_1/CAMPAREE/data/sample1/custom_genome_1.fa: all must be in ACGTN

Would this be because of some R and Y characters in the reference genome?

marzie-rasekh commented 11 months ago

I fixed the reference and reran the pipeline on two samples separately. It took a very long time with 36 threads (where ever possible), however, the pipeline was executed successfully. Thank you.

brainfood commented 11 months ago

Thank you very much for testing the patch, and for your feedback! CAMPAREE is a fairly involved compute, so the runtime isn't too surprising. It's effectively running a full alignment, gene/intron/transcript quantification, and variant calling pipeline on each sample. If you weren't already, running it in a cluster environment tends to speed things up more than adding threads. That's an area for optimization we should explore further.

I'm closing this issue as resolved, but I'm marking down support for non-standard bases as a potential feature to add in future releases.