open-reaction-database / ord-data

Official data repository for the Open Reaction Database
https://open-reaction-database.org
Creative Commons Attribution Share Alike 4.0 International
219 stars 55 forks source link

Migrate ord-data to ord-schema v0.3.0 #75

Closed skearnes closed 3 years ago

skearnes commented 3 years ago
skearnes commented 3 years ago

@connorcoley I'm going to request an exception to the usual checks here. There's a bug in process_dataset.py with how comparisons are made against gzipped datasets vs. the main branch that is causing the jobs to fail. But even if that bug weren't there, the fact that process_dataset.py validates each dataset twice (once on the way in and once on the way out) means that the job would never finish in the 6 h time limit.

The count_reactions job confirms that the number of reactions hasn't changed, and the (sharded) validate_database jobs are checking that everything in the update validates correctly, so I think we're safe to merge without the process_submission jobs.

connorcoley commented 3 years ago

@connorcoley I'm going to request an exception to the usual checks here. There's a bug in process_dataset.py with how comparisons are made against gzipped datasets vs. the main branch that is causing the jobs to fail. But even if that bug weren't there, the fact that process_dataset.py validates each dataset twice (once on the way in and once on the way out) means that the job would never finish in the 6 h time limit.

The count_reactions job confirms that the number of reactions hasn't changed, and the (sharded) validate_database jobs are checking that everything in the update validates correctly, so I think we're safe to merge without the process_submission jobs.

Since the number of reactions is the same and the validation is working, then I'm inclined to agree. I suppose this will be a problem any time we do a migration since process_dataset will far too long given the current size of the database?

skearnes commented 3 years ago

Since the number of reactions is the same and the validation is working, then I'm inclined to agree. I suppose this will be a problem any time we do a migration since process_dataset will far too long given the current size of the database?

Yep, but I think in situations like this where we're not expecting changes the validations should be sufficient.