kasei / perlrdf

Deprecated in favor of the Attean package
26 stars 25 forks source link

Fixes for RDF::Trine::Store::DBI::SQLite #126

Closed mfrager closed 9 years ago

mfrager commented 9 years ago

SQLite only support SIGNED 64-bin numbers and the hashing method "_mysql_hash" would sometimes create numbers that overflowed SQLite maximums and then get converted into floating-point representations that were not usable. The number of md5 characters used to generate the hash number has been reduced from 8 to 7 to make sure that the IDs generated never overflow SQLite's maximums.

kasei commented 9 years ago

Interesting. I wonder why I've never seen failures for this. Could you include a test case that demonstrates an overflow?

I'm definitely interested in fixing this, but might look into something a bit more nuanced, as this schema was designed to be compatible with redland, and so would break compatibility with both redland systems as well as existing RDF::Trine databases.

mfrager commented 9 years ago

Here is an example that will generate a failure, just use http://testme.com/ as a URI and it will overflow. Here's the details:

Data(test) -> MD5Prefix(9,143,107,205,70,33,211,115) -> Int(8346051122425466633)

Data(LTest<>) -> MD5Prefix(68,86,105,112,246,2,125,56) -> Int(4070412895683958340)

Data(Rhttp://dev.atellix.com/v1/inst/45797812-DD52-11E4-AC58-1BB05FE580B2) -> MD5Prefix(41,59,122,184,72,180,160,45) -> Int(3287825952406125353)

Data(Rhttp://dev.atellix.com/v1/attr/rdf/prefix/prefix) -> MD5Prefix(196,145,157,115,35,178,105,40) -> Int(2912054499405042116)

Data(Rhttp://testme.com/) -> MD5Prefix(205,246,95,50,153,248,215,226) -> Int(16345806709423339213)

sqlite> .schema Statements8346051122425466633 CREATE TABLE Statements8346051122425466633 ( Subject NUMERIC(20) NOT NULL, Predicate NUMERIC(20) NOT NULL, Object NUMERIC(20) NOT NULL, Context NUMERIC(20) NOT NULL DEFAULT 0, PRIMARY KEY (Subject, Predicate, Object, Context) ); sqlite> select * from Statements8346051122425466633; 3287825952406125353|2982895206037061277|5176259570156409534|0 3287825952406125353|3108168581889151792|4070412895683958340|0 3287825952406125353|2912054499405042116|1.63458067094233e+19|0 sqlite> select typeof(Subject), typeof(Predicate), typeof(Object) from Statements8346051122425466633; integer|integer|integer integer|integer|integer integer|integer|real

kasei commented 9 years ago

I see. That is a problem. What would you think of just masking out the top bit of the hash values instead of using one fewer byte? Either way it's a breaking change, but masking the high-bit would allow non-overflowing hashes and table names to keep the same values, but allow all values to work with SQLite.

mfrager commented 9 years ago

Masking the highest bit would probably be fine. There is enough variation in the MD5 checksums so that the chances of a collision would be extremely low. That would probably maintain compatibility with a lot of existing databases. However, a lot of those existing database might be silently failing or returning incomplete results.

In any case, having a fully functional RDF datastore including SPARQL support in a SQLite file is a powerful and exciting feature. Kudos!

kasei commented 9 years ago

OK. I'll go ahead and merge this and change the hash function to do the bit masking instead of dropping bytes. Thanks for raising this issue.

mfrager commented 9 years ago

Sounds good! A couple things when you do that:

  1. The unit test cases can be reverted back to what they were before.
  2. The "_init" I added method in RDF::Trine::Store::DBI::SQLite with a different SQL schema will probably not be necessary.

Thanks,

-Mike

kasei commented 9 years ago

If you want to give that new bugfix-sqlite-hash branch a look, I'll leave it there for a bit before merging into master. Looks good to me, though.

mfrager commented 9 years ago

Tried it out... Everything looks good!