pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
656 stars 172 forks source link

Check order when multiple pairs of FASTQs are supplied #467

Open eturro opened 2 hours ago

eturro commented 2 hours ago

Hi, thanks for producing this great software. I've a suggestion to add a check of user input.

The documentation states "The default running mode is paired-end and requires an even number of FASTQ files represented as pairs, e.g. kallisto quant -i index -o output pairA_1.fastq pairA_2.fastq pairB_1.fastq pairB_2.fastq"

It seems that the order is critical. If the user accidentally supplies the files in the order pairA_1.fastq pairB_1.fastq ... pairA_2.fastq pairB_2.fastq ... , kallisto runs without issuing any errors, but outputs erroneous quantities.

The suggestion is to add a check of the user input.

Yenaled commented 2 hours ago

I don’t really think there is a way to check for that. FASTQ files are just DNA base sequences, and there’s no way for a program to know what file should be paired with what.

eturro commented 1 hour ago

Can't you just check the read names (at least the first few)? They should match between files in a pair.

Yenaled commented 1 hour ago

That would be possible — I often work with FASTQs with altered read names, but perhaps a warning could be printed out if the names don’t match. Will consider it.

eturro commented 1 hour ago

Alternatively you could check the read counts are the same for the first and second files, the third and fourth files, and so on. But that seems more complicated than simply checking the name of the first read from each file, which would normally give the pattern readA readA readB readB readC readC ...

I'm surprised the program doesn't fail due to the pairs of files being considered to be a pair not having the same numbers of reads actually.