marbl / MHAP

MinHash Alignment Process (MHAP, pronounced MAP): locality-sensitive hashing to detect long-read overlaps and utilities
Apache License 2.0
96 stars 13 forks source link

Sequence ordinal in results is i+1 in range of i 1 to N #14

Closed mathog closed 4 years ago

mathog commented 8 years ago

Version 2.1.1

The ordinals in the first column of MHAP output seem to be consistently 1 too high, assuming they are supposed to represent the query sequence positions in the pacbio input file in the range 1 to N.

Example: make a query file "just_two.fasta" which contains two pacbio sequences, each of which matches somewhere in a single target sequence "target.fasta".

Align the query reads to the target sequence:

/m120317_230406_00121_c100309752550000001523013008061277_s1_p0.fasta >just_two.fasta
/usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -jar ~/src/MHAP/target/mhap-2.1.1.jar \
    -s /tmp/target.fasta -q  just_two.fasta \
    --num-threads 2 2>/dev/null

Results are:

3 1 0.116346 27.000000 0 69 692 1043 1 38 593 6655 2 1 0.120035 31.000000 0 2130 2842 2872 0 0 645 6655

The first two values should have been 2 and 1.

This was initially noticed with much larger numbers, where the pacbio reads were in positions 69233 and 69234 but were numbered as 69234 and 69235.

skoren commented 8 years ago

MHAP always increments the sequence ID when it sees a new fasta line. Since you are inputting two sequences (-s and -q) the count for -q will start at |s|+1. So the target.fast gives sequence ID 1 and the reads in just_two.fast will have IDs 2,3. You can also specify `--store-full-id which will use original sequence IDs.

mathog commented 8 years ago

Ah, I see.

It would be nice if there was a switch to let it keep the sequence ordinals for -s and -q separate. Perhaps --use-file-ordinal, or something like that? Then an N x M comparison would have 1-N in the first column and 1-M in the second, so that either could be used directly to index back into the respective files.

The sequence IDs for the query and target sequences in this case are both very long strings and not much fun to work with.

Thanks.

konstantinberlin commented 8 years ago

I will ponder it, though we already have flag overload and it would make for an ambitious output (is it 3-1 or 1-3).

mathog commented 8 years ago

If not exactly ambiguous now it is at least a little confusing. Does the ordinal assignment change for -s -q vs. -q -s? If not, where in -h does it state that -s goes first? The observed ordinal assignment across files (plural) also contradicts the explicit -h description of that assignment by file (singular) here:

--store-full-id, default = false
  Store full IDs as seen in FASTA file, rather than storing just the sequence position in the file...

although this could just be the result of a typo.

skoren commented 8 years ago

We should clarify the docs but I think it doesn't makes sense to not increment the counters for the -q files to start after -s. There can be multiple -q inputs and so if you re-start the count at each file it would be ambiguous. Also, all reads in the -s input are compared against themselves so again it would be ambiguous. It would be OK in the cases where -q had only a single file or --no-self was specified but I think it would be even more confusing to change how we count sequence IDs in special cases rather than just always incrementing a counter.

For simple cases, you don't need to use -q, you can give all your input reads as -s. That will keep the count as 1-3 and consistent with the sequence order in the -s file.