marbl / Winnowmap

Long read / genome alignment software
Other
235 stars 22 forks source link

List of files / multiple files as input #9

Open SergejN opened 3 years ago

SergejN commented 3 years ago

Dear maintainers,

is it possible to add a possibility to specify a list of input files instead of a single file? I work with the axolotl genome and have quite a few long reads. Therefore, I have two possibilities

However, since the genome is to huge, minimap2 has to split the index. Therefore, if I pipe the data, winnowmap ends up mapping the reads only to the first 5 scaffolds, which are included in the first index chunk. Other scaffolds are processed as well afterwards, but there are no more data in the pipe. It would be nice to be able to specify multiple input files, which all can be read multiple times if necessary.

I also tried creating the index first by setting -d scaffolds.mmi, and then running winnowmap, but in this case I get a segmentation fault.

thanks!

cjain7 commented 3 years ago

You can run the mapper as:

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont1.fq.gz ont2.fq.gz ont3.fq.gz  ...

Will this resolve your issue?

cjain7 commented 3 years ago

BTW, you can also tweak the size of chunk that is processed at a time (assuming you can tolerate more memory-usage) using -I parameter.

See https://lh3.github.io/minimap2/minimap2.html

SergejN commented 3 years ago

You can run the mapper as:

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont1.fq.gz ont2.fq.gz ont3.fq.gz  ...

Will this resolve your issue?

In theory, yes, but it's also super inconvenient to specify the names of 137 files on the command line.

BTW, you can also tweak the size of chunk that is processed at a time (assuming you can tolerate more memory-usage) using -I parameter.

See https://lh3.github.io/minimap2/minimap2.html

Yes, I saw this parameter, but I had the impression that minimap2 cannot process sequences longer than 4G. I now saw that this was incorrect and only applies to a single sequence within the dataset and not the total length of the sequences. I will give it a try and set -I to the whole genome size (32Gb). Thanks!

jelber2 commented 3 years ago

You might be able to do

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa <(ls -1 *.fq.gz|tr '\n' ' ')

Not tested *assumes all FASTQ files are desired and have the extension .fq.gz

SergejN commented 3 years ago

Yes, sure. This will also work, unless you have to specify so many files that the command line becomes too long (2MB on my system, so quite a few file names):

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa $(find . -name "*.fq.gz" | grep -v 'whatever_you_want_to_exclude' | 'tr '\n' ' ')

But I wanted to propose a more elegant way. Of course, I can also put the file names into a text file and then run (assuming there are no spaces or other weird characters)

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa $(cat filelist | tr '\n' ' ')