lbcb-sci / herro

HERRO is a highly-accurate, haplotype-aware, deep-learning tool for error correction of Nanopore R10.4.1 or R9.4.1 reads (read length of >= 10 kbps is recommended).
Other
189 stars 11 forks source link

conda environment installation issues #65

Open sjfleck opened 1 day ago

sjfleck commented 1 day ago

Hello, I want to use HERRO for its experimental R9.4.1 correction model, but I'm having issues running this command to set up the conda environment:

conda env create --file scripts/herro-env.yml

I get an error saying that some dependencies are not available from conda-forge and bioconda. I went into the .yml file and searched through the list of dependencies and version numbers and was able to find many of the missing ones in the intel, anaconda, cctbx202211, popiclab-dev channels. I assume I can add these to the "channels:" section and that would fix that issue.

I updated it in the file by adding (please ignore the github format change):

channels:

That said, there are still some dependencies that are no longer available with their listed version. These are:

I'm new to running singularity images, but if I'm understanding correctly, I will need this conda environment active for when I run this command:

singularity run --nv --bind : herro.sif inference

Do you either have an updated herro-env.yml file or is it advisable for me to change the unavailable dependency versions to available ones (even though this may not have been tested) and try to finish installing the conda environment?

Thank you for any help you're able to provide on this

sjfleck commented 13 hours ago

This problem is worse than I thought because many of the software versions were only available on the intel channel, which isn't around anymore. That makes many more dependencies unavailable.

I was able to access an old conda environment for herro that someone installed on our system in May and only needed to manually install porechop and duplex_tools using pip in order to do the preprocessing steps without any errors.

When I finally ran herro, this was my command: singularity run --nv --bind ${host_path}:${dest_path} herro.sif inference \ --read-alns ${directory_alignment_batches} \ -t ${feat_gen_threads_per_device} \ -d 0,1 \ -m ${model_path} \ -b ${batch_size} \ ${output_prefix}.fastq.gz ${fasta_output}

and I ended up with this error: [W graph_fuser.cpp:108] Warning: operator() profile_node %1243 : int[] = prim::profile_ivalue(%1241) does not have profile information (function operator()) [00:01:22] Processed 205217 reads.

the low number of reads is because I'm using a small input fastq .gz file (5.7Gb with no reads smaller than 5kb since 10kb eliminated too much) as a test to get herro working. I will note that the original file started with 434,739 reads. The "preprocess_cont.sh" command appeared to utilize all the reads and this was the output from that command:

[10:34:19 - SplitOnAdapters] Split 6145 reads Kept 428597 reads [10:34:19 - SplitOnAdapters] Wrote a total of 12388 reads [WARN] you may switch on flag -g/--remove-gaps to remove spaces

When I ran "create_batched_alignments.sh", this was the output:

0it [00:00, ?it/s][M::mm_idx_gen::89.4011.61] collected minimizers [M::mm_idx_gen::90.9882.16] sorted minimizers [M::main::90.9882.16] loaded/built the index for 244684 target sequence(s) [M::mm_mapopt_update::96.7382.09] mid_occ = 13 [M::mm_idx_stat] kmer size: 25; skip: 17; is_hpc: 0; #seq: 244684 [M::mm_idx_stat::100.1562.05] distinct minimizers: 307960901 (75.75% are singletons); average occurrences: 1.720; average spacing: 8.896; total length: 4711813619 ... [M::worker_pipeline::209.95110.52] mapped 244684 sequences [M::main] Version: 2.26-r1175 [M::main] CMD: minimap2 -K8g -cx ava-ont -k25 -w17 -e200 -r150 -m2500 -z200 -f 0.005 -t64 --dual=yes preprocessed.fastq.gz preprocessed.fastq.gz [M::main] Real time: 210.432 sec; CPU: 2209.777 sec; Peak RSS: 34.926 GB

Is the error I got from running herro concerning or can I ignore it? Any help with this would be greatly appreciated. Thank you, Steve