USPTO images missing for the "small" train script

rytheranderson commented 6 months ago

Hello, I've recently been attempting to retrain MolScribe using the scripts you provide in scripts. First of all, thanks for providing all your data and training scripts, extremely helpful.

Second, when running train_uspto_joint_chartok.sh I get a series of missing image warnings like:

[ WARN:0@27.895] global loadsave.cpp:248 findDecoder imread_('data/uspto_mol/2002/20020723/US06423704-20020723/US06423704-20020723-C00135.TIF'): can't open/read file: check file path/integrity

I downloaded the ZIP from the link provided in the README: https://www.dropbox.com/s/3podz99nuwagudy/uspto_mol.zip?dl=0, and unzipped it into data. This is not an issue when running train_uspto_joint_chartok_1m680k.sh. The problem is the uspto_mol/train_200k.csv has paths to images not provided in the ZIP archive.

It would be good to be able to run the smaller training set for quicker comparisons to your saved checkpoint. Let me know if this is fixable. Thanks for your time and this model!

thomas0809 commented 6 months ago

Sorry for the late reply. In our paper, we only keep the model trained with 1M synthetic data and 680K patent data. Therefore only these data are released and we encourage to use them for future comparison.

If you still want that 200K data, please send me an email at yujieq@csail.mit.edu. I can give you a link for private download.

rytheranderson commented 6 months ago

Thanks for the explanation, the 200k data isn't necessary for me. But, it may be good to make a note that it is not released in the 200k training script or README, however.

thomas0809 / MolScribe

USPTO images missing for the "small" train script #13