samtools / htslib

C library for high-throughput sequencing data formats
Other
813 stars 446 forks source link

hts_idx_set_meta #936

Open brentp opened 5 years ago

brentp commented 5 years ago

Perhaps this is just a documentation issue, but I am having problems (memory issues) with hts_idx_set_meta. Here is the source:

int hts_idx_set_meta(hts_idx_t *idx, uint32_t l_meta, uint8_t *meta,
                      int is_copy)
{
    uint8_t *new_meta = meta;
    if (is_copy) {
        size_t l = l_meta;
        if (l > SIZE_MAX - 1) {
            errno = ENOMEM;
            return -1;
        }
        new_meta = malloc(l + 1);
        if (!new_meta) return -1;
        memcpy(new_meta, meta, l);
        // Prevent possible strlen past the end in tbx_index_load2
        new_meta[l] = '\0';
    }
    if (idx->meta) free(idx->meta);
    idx->l_meta = l_meta;
    idx->meta = new_meta;
    return 0;
}

Note that when is_copy == 1, the length of new_meta is l_meta + 1, but l_meta is not updated. I can get around this by passing an extra byte (in addition to the final trailing \0 for the last chromosome in by csi index), but I think it would be better to document that the * meta passed in should contain the trailing \0 and then have new_meta of length l_meta rather than adding a byte.

valeriuo commented 5 years ago

By convention, the size of meta is l_meta + 1, with the last element set to '\0'. Thus, the length is l_meta, similar to strlen behaviour. However, there might be an issue with hts_idx_tbi_name, which includes the terminator value in the new length.

jkbonfield commented 5 years ago

You can pass meta in as the size you desire, with a consistent l_meta. You do not need to add allocate meta to be one byte longer than l_meta with an extra nul byte as this will not be copied by hts_idx_set_meta. The extra byte it uses is allocated by itself and written by itself. The idx->l_meta variable is also copied from your supplied l_meta as that is the correct size that should be written to disc.

You may therefore be wondering what the +1 in the above code is for. If we gave it 100 bytes, why does it copy 100 bytes and add 1 more nul? This code is used in common between CSI and TBI indices. Neither has a particularly pleasing specification, but TBI uses the meta data as a series of nul terminated reference names. That should include a nul on the last name too according to the TBI specification (thus meta should be e.g. "foo\0bar\0" with l_meta of 8).

However I assume the +1 is purely defensive coding for a case where hts_idx_set_meta is called with a series of concatenated nul-terminated strings where the last one lacks the termination nul (e.g. "foo\0bar" and l_meta of 7). That's invalid TBI format, but elsewhere in tbx_index_load2 we decode the TBI meta data using strlen and the lack of a nul termination byte would lead to possible crashes, hence it was defensively added just incase.

This is an internal nuance that can safely be ignored. Indeed you definitely shouldn't be manually adding a '\0' to meta and updating your l_meta before calling hts_idx_set_meta for CSI as the meta data is a bgzf block (I think - the CSI spec is sadly totally silent on the this front).

brentp commented 5 years ago

thanks for the explanations. I checked samtools/bcftools and htslib and there is no code that uses nonzero value is_copy. when I use that in my code with the expected lengths, including trailing \0, of course, it errors. I have adjusted the code to simply use is_copy 0 (with all else identical) and it avoids this error. I'm not sure how I would adjust the length or the meta to when using a copy. When I do something that avoids the memory error, I get an extra, empty chromosome name of in the index.

FWIW, here is the (nim) code that now does not do a copy in question. AFAICT it's not doing anything odd. https://github.com/brentp/hts-nim/blob/c34614ce1e929e74b79756de8e1a0a7efbdddd90/src/hts/csi.nim#L60-L85

brentp commented 5 years ago

maybe I misread your prescription. do I pass in meta without the final trailing '\0' and the appropriate l_meta?

jkbonfield commented 5 years ago

It looks like you're doing CSI indices. I was wrong earlier in thinking CSI meta is a bgzf block - it's written in htslib with bgzf_write but only because the rest of the index is also bgzf. So possibly my previous comment was wrong.

I wish the CSI spec actually contained an explanation of what the format really is. If I recall there are also hybrid indices lurking somewhere - either CSI index with a TBI meta-data block or vice versa! Possibly that was BAM CSI vs VCF CSI or something? (Thanfully!) I forget now, but I got thoroughly lost while doing the auto-indexing changes.

You're in a twisty maze of indices all alike!

I'll unpick the code again and try to figure out what the CSI format is.

PS. I wondered if anyone was working on Nim bindings. Good to see. :-)

jkbonfield commented 5 years ago

Ok so it's making a bit more sense. I believe this is how it works.

On BCF and BAM, meta is blank in CSI. On VCF, CSI meta is a partial TABIX header struct (missing the magic and n_ref fields). This also means VCF CSI meta contains TABIX meta (so we have meta inside a larger meta).

@pd3 @lh3 could one of you please file a PR to hts-specs and document this stuff in CSIv2.tex? It's extremely confusing and very unobvious. Also please check my interpretation is correct.

Your nim code copies each chrom character by character and then increments offset by 1 without explicitly copying in a nul. Does the new seq in nim automatically zero the contents? If not then you'll have uninitialised data between each reference name.

It looks like you've correctly allocated the array size though to be 28 + sum of string lengths including each nul, and are passing in the length correct too. So I don't see why it would fail unless it's the uninitialised problem above.

brentp commented 5 years ago

new_seq does give zero'ed memory.

jkbonfield commented 5 years ago

In that case I don't understand where the error lies, sorry. Maybe @pd3 who did CSI or @daviesrob who added that code can help.