Open Shians opened 5 years ago
I've hacked it myself and introduced a
int index_nodb_main(int argc, char** argv)
{
parse_index_options(argc, argv);
// Read a map from fast5 files to read name from the sequencing summary files (if any)
process_summary_fofn();
std::multimap<std::string, std::string> fast5_to_read_name_map;
for(const auto& ss_filename : opt::sequencing_summary_files) {
if(opt::verbose > 2) {
fprintf(stderr, "summary: %s\n", ss_filename.c_str());
}
parse_sequencing_summary(ss_filename, fast5_to_read_name_map);
}
// import read names, and possibly fast5 paths, from the fasta/fastq file
ReadDB read_db;
read_db.build(opt::reads_file);
return 0;
}
to nanopolish_index.cpp. I don't fully understand the ramifications but it seems to work.
The intended use case for most applications is to cat all the fastq files together, then run nanopolish index
once. Next week when I'm back at work I'll look into adding an option like --skip-fast5s
to avoid building the .readdb
file. Alternatively I could add an option to index a single .fast5 file. Let me know what would be most useful for you.
Thanks @jts. The reason for what I'm doing is because I noticed that even with 16 threads allocated, the cluster reports less than 70% CPU utilisation, this is with new data I got with multi-fast5 files and I haven't found where the bottleneck is. However by running jobs separately on smaller files I'm able to run ~100 jobs at a time which speeds things up significantly, completing a job that I estimated would take over two months in less than a day.
I'll need to think a bit more about whether it's worth-while properly supporting my kind of workflow. The situation right now is quite messy:
nanopolish index
once to get a single readdb file which I split using an R script to a collection of readdb files containing the index relevant to each fastq file.nanopolish index-nodb
on the rest of the fastq files to get the remaining index files.nanopolish call-methylation
.I'm not sure how much of this makes sense to support in nanopolish. From an interface perspective, rather than skipping the fast5 indexing, it would make more sense to index a folder of fastq files. Since the work to index the folder of fast5 has to be done at least once, but should not be repeated per fastq if they came from a common origin folder.
Did you resolve this issue @Shians or come up with an enhanced workflow?
I am in a similar position.
We have over 80 million PromethION reads and around 10TB of raw fast5. If I concatenate all fastq.gz I get an gzipped archive of 0.5TB. Indexing that fastq archive alone to get the .index, .index.fai,and .index.gzi took a long time. Subsequently scanning across all fast5 files to produce the .readdb in a single nanopolish process is very time consuming.
Do you or @jts have any suggestions for how to speed up/parallelizing the fast5 indexing?
Can the index job be restarted without re-indexing the fastq.gz archive (i.e. moving on if the index is already there)?
I am using nanopolish to call methylation. As far as I can tell
nanopolish index
indexes the entire contents of a fast5 folder and a single fastq file. So for multiple fastq file derived from the same folder of fast5 files (as in the case when Guppy produces fastq containing 4000 reads each) theindex.readdb
will be computed for EACH of thenanopolish index
calls, even though it contains enough information for all the fastq files.I have written my own script to split a
index.readdb
into smaller files containing only the reads relevant to specific fastq files, however I cannot run nanopolish without generating the other 3 index files.Is it possible to skip the fast5 indexing step while still generating the remaining indices?