Duplicate uploads - Githubissues

gabor1 commented 5 years ago

Do we want to prevent or at least warn about duplicate uploads? E.g. check some hash, and if it's really the same, then at least warn the user before committing the upload?

gabor1 commented 5 years ago

We can defer to this when we have thought more about hashing, which is tied into caching

fekad commented 4 years ago

Decision:

creating a hash value from all the keys (array, info, results) and store it as derived value
the float values should be truncated

new command options:

abcd upload --ignore-duplicates 
abcd upload --remove_duplicates

fekad commented 4 years ago

Is it ok the hash of the string representation of a float? What should be the precision of the truncated float? Or is there any other way to do the truncation more efficiently?

@gabor1 Do you want a structure-specific (which only based on position, cell and pbc) hash value as well?

gabor1 commented 4 years ago

I think both hashes are interesting, one that is just position/cell/pbc based (remember that many input xyzzy don't have the pic key set), and also one that incorporates everything. in the upload functionality, it would be nice to have options that allow me to get warnings if the structure is the same, or if the entire config is the same, and to take various actions

gabor1 commented 4 years ago

the corresponding derived quantities could be "hash_structure" and "hash". the upload options that I think would be useful: --warn-duplicate, --warn-duplicate structure, --ignore-duplicate there should be a way of calling the upload so that it is not actually executed, but the warnings are given (something like a --dry option?)

fekad commented 4 years ago

Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes

print(hash(1), hash((1,2)), hash(3.14), hash('hello'))

gabor1 commented 4 years ago

um… what’s the point of a hash if it gives different values for the same thing… there must be a way to use ot so that it gives the same value

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 21 Oct 2019, at 14:30, Adam Fekete notifications@github.com wrote:

Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes

print(hash(1), hash((1,2)), hash(3.14), hash('hello'))

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gabor1 commented 4 years ago

maybe you want an md5 hash?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 21 Oct 2019, at 14:31, Gabor Csanyi gc121@cam.ac.uk wrote:

um… what’s the point of a hash if it gives different values for the same thing… there must be a way to use ot so that it gives the same value

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 21 Oct 2019, at 14:30, Adam Fekete notifications@github.com wrote:

Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes

print(hash(1), hash((1,2)), hash(3.14), hash('hello'))

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gabor1 commented 4 years ago

https://docs.python.org/2/library/hashlib.html#module-hashlib

fekad commented 4 years ago

I was thinking about md5 as well but I hoped there is a more elegant solution. We should not need to use a cryptographic hash like the ones in hashlib. In our case, two close number can have close hash values...

Python uses int64 numbers for the hash values:

Out[3]: type(hash('adsgkjdskgljsdlkjdg4'))
Out[3]: int

In [4]: hash(1), type(hash(1))
Out[4]: (1, int)

In [5]: hashlib.md5(bytes(1)).hexdigest()
Out[5]: '93b885adfe0da089cdf634904fd59f71'

It looks simple and nice and it has almost the same amount of possibilities that youtube uses for the whole world's videos.

gabor1 commented 4 years ago

I agree it doesn't need to be cryptographic. But didn't you say that the hash() function is not deterministic?

-- Gábor

On 21 Oct 2019, at 23:16, Adam Fekete notifications@github.com wrote:

I was thinking about md5 as well but I hoped there is a more elegant solution. We should not need to use a cryptographic hash like the ones in hashlib. In our case, two close number can have close hash values...

Python uses int64 numbers for the hash values:

Out[3]: type(hash('adsgkjdskgljsdlkjdg4')) Out[3]: int

In [4]: hash(1), type(hash(1)) Out[4]: (1, int)

In [5]: hashlib.md5(bytes(1)).hexdigest() Out[5]: '93b885adfe0da089cdf634904fd59f71' It looks simple and nice and it has almost the same amount of possibilities that youtube uses for the whole world's videos.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fekad commented 4 years ago

As far as I know, it is not deterministic only for the strings. integer floats are ok. Here you can see the previously commented example:

In [8]: print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
   ...:
1 3713081631934410656 322818021289917443 -8637518414175570336

In [9]:
Do you really want to exit ([y]/n)? ^D
$ ipython
Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 14:38:56)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
1 3713081631934410656 322818021289917443 -6054905812727825455

gabor1 commented 4 years ago

i found this

https://stackoverflow.com/questions/27954892/deterministic-hashing-in-python-3

they say use a different hash... maybe there are cheaper ones than md5

-- Gábor

On 21 Oct 2019, at 23:25, Adam Fekete notifications@github.com wrote:

As far as I know, it is not deterministic only for the strings. integer floats are ok. Here you can see the previously commented example:

In [8]: print(hash(1), hash((1,2)), hash(3.14), hash('hello')) ...: 1 3713081631934410656 322818021289917443 -8637518414175570336

In [9]: Do you really want to exit ([y]/n)? ^D $ ipython Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56) Type 'copyright', 'credits' or 'license' for more information IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: print(hash(1), hash((1,2)), hash(3.14), hash('hello')) 1 3713081631934410656 322818021289917443 -6054905812727825455

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fekad commented 4 years ago

The zlib.adler32() looks interesting bit 32 bit is not enough. You can easily have conflicts. At this level, there is a huge difference between 32- and 64-bit hashes.

Did I miss something because the rest is just the same reference to python hashlib library?

If I remember right md5 is 128 bit long but we must convert everything into byte arrays...

Or it is possible that there is a completely different valid solution like generating a JSON text stream and hashing that one by md5. (of course, the keys must be ordered ...)

fekad commented 4 years ago

This might be interesting: https://amp.readthedocs.io/en/latest/useamp.html#advanced-use

gabor1 commented 4 years ago

maybe stick with md5 then

-- Gábor

On 22 Oct 2019, at 00:10, Adam Fekete notifications@github.com wrote:

This might be interesting: https://amp.readthedocs.io/en/latest/useamp.html#advanced-use

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fekad commented 4 years ago

The current version of hashing has been implemented using MD5. For the future, I recommend using a non-cryptographic hash to get better performance if it is possible to solve the issue with the strings.

About the command options (--warn-duplicate, --warn-duplicate structure, --ignore-duplicate, --dry), what should be the default behaviour for the upload command? According to the flags, the upload command will raise an error if there are similar structures OR when the full hash is the same.

gabor1 commented 4 years ago

how hard would it be to count the duplicates all the time, and always report them (separately report number of structural duplicates and complete duplicates), and by default as a y/n question whether to proceed with the upload or not? This means no --warn* options, but if the --drop-duplicates option is given, then duplicates are NOT uploaded, and if the --upload-duplicates option is explicitly given, then they are uploaded ?

I guess if questioning the user is hard, then we can just quit, and if the above options are given, then we execute accordingly

fekad commented 4 years ago

It is not harder than the others. Of course, you have to pay a performance penalty about a factor of 3. You need to iterate/convert everything twice and checking the hashes one by one on the whole database. The hashes need to be cached in the DB. I do not have too much experience in that field but maybe you know somebody...

gabor1 commented 4 years ago

ok, how about this. let's not do questions etc. default behaviour: check for duplicates, and if there are none, then do the upload (second pass on the file). If there are, then report "X duplicate structures found (different metadata). Y duplicate configurations found (including metadata)", and exit without uploading anything. Then the following options can be given.

if --upload-duplicates is given, you still report both types of duplicates, but still upload them (this can all be done in one pass, but if you prefer to do two for simplicity, that's fine)
if --upload-structure-duplicates is given, then report both types of duplicates, but do not insert EXACT duplicates into the database (just drop them), but still insert structural duplicates that have different metadata
if --upload-duplicates-replace is given, then upload everything and duplicates overwrite previously existing data (this is useful for correcting errors), so resulting database should have no duplicates (neither structural nor exact) if it didn't have any before

let me know if you see a problem with this plan!

fekad commented 4 years ago

It looks ok for me. Just to make sure the --upload-duplicates-replace basically means that we remove all the items with matching hash_structure and add the new one, isn't? The new one will get its own new unique id.

gabor1 commented 4 years ago

yes

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 24 Oct 2019, at 17:10, Adam Fekete notifications@github.com wrote:

It looks ok for me. Just to make sure the --upload-duplicates-replace basically means that we remove all the items with matching hash_structure and add the new one, isn't? The new one will get its own new unique id.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fekad commented 4 years ago

Just to make use:

if --upload-structure-duplicates is given, then report both types of duplicates, but do not insert EXACT duplicates into the database (just drop them), but still insert structural duplicates that have different metadata

in this case, if there is an EXACT duplicate the derived properties (username, upload and modified date, etc...) will stay the old unchanged values. dropping != reuploading.

if --upload-duplicates-replace is given, then upload everything and duplicates overwrite previously existing data (this is useful for correcting errors), so resulting database should have no duplicates (neither structural nor exact) if it didn't have any before

So if the database already contained some structural duplicates all of them will be deleted before uploading the new one. eg: 10 calculation with the same structure but different "metadata" will be replaced by the new one.

What should happen if the file that you want to upload contains more than one calculation with the same structure or even multiples of the same calculation? If they not have been stored in the DB before the check will miss them. (It is another factor of N to check them properly)

gabor1 commented 4 years ago

Good point. I think we just upload the new file and if it has duplicates, fine, they will go into the database

-- Gábor

On 2 Nov 2019, at 03:14, Adam Fekete notifications@github.com wrote:

Just to make use:

if --upload-structure-duplicates is given, then report both types of duplicates, but do not insert EXACT duplicates into the database (just drop them), but still insert structural duplicates that have different metadata in this case, if there is an EXACT duplicate the derived properties (username, upload and modified date, etc...) will stay the old unchanged values. dropping != reuploading.

if --upload-duplicates-replace is given, then upload everything and duplicates overwrite previously existing data (this is useful for correcting errors), so resulting database should have no duplicates (neither structural nor exact) if it didn't have any before So if the database already contained some structural duplicates all of them will be deleted before uploading the new one. eg: 10 calculation with the same structure but different "metadata" will be replaced by the new one.

What should happen if the file that you want to upload contains more than one calculation with the same structure or even multiples of the same calculation? If they not have been stored in the DB before the check will miss them. (It is another factor of N to check them properly)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

libAtoms / abcd

Duplicate uploads #35