Closed mfrager closed 9 years ago
Interesting. I wonder why I've never seen failures for this. Could you include a test case that demonstrates an overflow?
I'm definitely interested in fixing this, but might look into something a bit more nuanced, as this schema was designed to be compatible with redland, and so would break compatibility with both redland systems as well as existing RDF::Trine databases.
Here is an example that will generate a failure, just use http://testme.com/ as a URI and it will overflow. Here's the details:
Data(test) -> MD5Prefix(9,143,107,205,70,33,211,115) -> Int(8346051122425466633)
Data(LTest<>) -> MD5Prefix(68,86,105,112,246,2,125,56) -> Int(4070412895683958340)
Data(Rhttp://dev.atellix.com/v1/inst/45797812-DD52-11E4-AC58-1BB05FE580B2) -> MD5Prefix(41,59,122,184,72,180,160,45) -> Int(3287825952406125353)
Data(Rhttp://dev.atellix.com/v1/attr/rdf/prefix/prefix) -> MD5Prefix(196,145,157,115,35,178,105,40) -> Int(2912054499405042116)
Data(Rhttp://testme.com/) -> MD5Prefix(205,246,95,50,153,248,215,226) -> Int(16345806709423339213)
sqlite> .schema Statements8346051122425466633 CREATE TABLE Statements8346051122425466633 ( Subject NUMERIC(20) NOT NULL, Predicate NUMERIC(20) NOT NULL, Object NUMERIC(20) NOT NULL, Context NUMERIC(20) NOT NULL DEFAULT 0, PRIMARY KEY (Subject, Predicate, Object, Context) ); sqlite> select * from Statements8346051122425466633; 3287825952406125353|2982895206037061277|5176259570156409534|0 3287825952406125353|3108168581889151792|4070412895683958340|0 3287825952406125353|2912054499405042116|1.63458067094233e+19|0 sqlite> select typeof(Subject), typeof(Predicate), typeof(Object) from Statements8346051122425466633; integer|integer|integer integer|integer|integer integer|integer|real
I see. That is a problem. What would you think of just masking out the top bit of the hash values instead of using one fewer byte? Either way it's a breaking change, but masking the high-bit would allow non-overflowing hashes and table names to keep the same values, but allow all values to work with SQLite.
Masking the highest bit would probably be fine. There is enough variation in the MD5 checksums so that the chances of a collision would be extremely low. That would probably maintain compatibility with a lot of existing databases. However, a lot of those existing database might be silently failing or returning incomplete results.
In any case, having a fully functional RDF datastore including SPARQL support in a SQLite file is a powerful and exciting feature. Kudos!
OK. I'll go ahead and merge this and change the hash function to do the bit masking instead of dropping bytes. Thanks for raising this issue.
Sounds good! A couple things when you do that:
Thanks,
-Mike
If you want to give that new bugfix-sqlite-hash branch a look, I'll leave it there for a bit before merging into master. Looks good to me, though.
Tried it out... Everything looks good!
SQLite only support SIGNED 64-bin numbers and the hashing method "_mysql_hash" would sometimes create numbers that overflowed SQLite maximums and then get converted into floating-point representations that were not usable. The number of md5 characters used to generate the hash number has been reduced from 8 to 7 to make sure that the IDs generated never overflow SQLite's maximums.