Open brentp opened 5 years ago
By convention, the size of meta
is l_meta + 1
, with the last element set to '\0'
. Thus, the length is l_meta
, similar to strlen
behaviour.
However, there might be an issue with hts_idx_tbi_name
, which includes the terminator value in the new length.
You can pass meta
in as the size you desire, with a consistent l_meta
. You do not need to add allocate meta to be one byte longer than l_meta
with an extra nul byte as this will not be copied by hts_idx_set_meta
. The extra byte it uses is allocated by itself and written by itself. The idx->l_meta
variable is also copied from your supplied l_meta
as that is the correct size that should be written to disc.
You may therefore be wondering what the +1 in the above code is for. If we gave it 100 bytes, why does it copy 100 bytes and add 1 more nul? This code is used in common between CSI and TBI indices. Neither has a particularly pleasing specification, but TBI uses the meta data as a series of nul terminated reference names. That should include a nul on the last name too according to the TBI specification (thus meta should be e.g. "foo\0bar\0" with l_meta of 8).
However I assume the +1 is purely defensive coding for a case where hts_idx_set_meta
is called with a series of concatenated nul-terminated strings where the last one lacks the termination nul (e.g. "foo\0bar" and l_meta of 7). That's invalid TBI format, but elsewhere in tbx_index_load2
we decode the TBI meta data using strlen and the lack of a nul termination byte would lead to possible crashes, hence it was defensively added just incase.
This is an internal nuance that can safely be ignored. Indeed you definitely shouldn't be manually adding a '\0' to meta and updating your l_meta
before calling hts_idx_set_meta
for CSI as the meta data is a bgzf block (I think - the CSI spec is sadly totally silent on the this front).
thanks for the explanations. I checked samtools/bcftools and htslib and there is no code that uses nonzero value is_copy
.
when I use that in my code with the expected lengths, including trailing \0
, of course, it errors. I have adjusted the code to simply use is_copy 0 (with all else identical) and it avoids this error. I'm not sure how I would adjust the length or the meta
to when using a copy. When I do something that avoids the memory error, I get an extra, empty chromosome name of in the index.
FWIW, here is the (nim) code that now does not do a copy in question. AFAICT it's not doing anything odd. https://github.com/brentp/hts-nim/blob/c34614ce1e929e74b79756de8e1a0a7efbdddd90/src/hts/csi.nim#L60-L85
maybe I misread your prescription. do I pass in meta without the final trailing '\0' and the appropriate l_meta?
It looks like you're doing CSI indices. I was wrong earlier in thinking CSI meta is a bgzf block - it's written in htslib with bgzf_write but only because the rest of the index is also bgzf. So possibly my previous comment was wrong.
I wish the CSI spec actually contained an explanation of what the format really is. If I recall there are also hybrid indices lurking somewhere - either CSI index with a TBI meta-data block or vice versa! Possibly that was BAM CSI vs VCF CSI or something? (Thanfully!) I forget now, but I got thoroughly lost while doing the auto-indexing changes.
You're in a twisty maze of indices all alike!
I'll unpick the code again and try to figure out what the CSI format is.
PS. I wondered if anyone was working on Nim bindings. Good to see. :-)
Ok so it's making a bit more sense. I believe this is how it works.
On BCF and BAM, meta is blank in CSI.
On VCF, CSI meta is a partial TABIX header struct (missing the magic
and n_ref
fields). This also means VCF CSI meta contains TABIX meta (so we have meta inside a larger meta).
@pd3 @lh3 could one of you please file a PR to hts-specs and document this stuff in CSIv2.tex? It's extremely confusing and very unobvious. Also please check my interpretation is correct.
Your nim code copies each chrom character by character and then increments offset by 1 without explicitly copying in a nul. Does the new seq in nim automatically zero the contents? If not then you'll have uninitialised data between each reference name.
It looks like you've correctly allocated the array size though to be 28 + sum of string lengths including each nul, and are passing in the length correct too. So I don't see why it would fail unless it's the uninitialised problem above.
new_seq
does give zero'ed memory.
In that case I don't understand where the error lies, sorry. Maybe @pd3 who did CSI or @daviesrob who added that code can help.
Perhaps this is just a documentation issue, but I am having problems (memory issues) with
hts_idx_set_meta
. Here is the source:Note that when
is_copy == 1
, the length ofnew_meta
isl_meta + 1
, butl_meta
is not updated. I can get around this by passing an extra byte (in addition to the final trailing\0
for the last chromosome in by csi index), but I think it would be better to document that the* meta
passed in should contain the trailing\0
and then havenew_meta
of lengthl_meta
rather than adding a byte.