thomas0809 / MolScribe

Robust Molecular Structure Recognition with Image-to-Graph Generation
MIT License
145 stars 27 forks source link

fixes #22 #23

Open eloyfelix opened 5 months ago

eloyfelix commented 5 months ago

Some images, for example US20230354702A1-20231102-C00260.TIF from USTPO grant red book (attached) makes MolScribe to hang for hours and use an unreasonable amount of RAM.

https://github.com/thomas0809/MolScribe/blob/97acee57d10bd719f4dc1cfd30d09f142b7dc65f/molscribe/chemistry.py#L200

shows:

[('L', 202)] 2020202020201 L 20202020202020201 L 2020202020201 L 20202020202020201

for this image. That means two trillions of iterations (attaching stuff to a list) in some cases that makes mass processing of images hang. Also using an unreasonable amount of memory.

The fix is extremelly simple skipping the processing of elements with more than 100000 atoms.

US20230354702A1-20231102-C00260.TIF.zip