tomassosorio / OCR_tablenet

TableNet Implementation on Pytorch
144 stars 45 forks source link

Import error #14

Closed LivingDeadCloud closed 3 years ago

LivingDeadCloud commented 3 years ago

Hi there @tomassosorio

Quick introduction: I need to extract data from PDF/images containing tables. Unfortunately, I have several different formats and traditional tools (PDFPlumber, Tabula, Camelot) do not seem to work for every possible format. So now I'm trying a DL approach, and looking for some TableNet implementation code I found this repo.

I'm trying to use you code on Google Colab, but unfortunately I was not able to make it work. Notice that I have very little experience with DL libraries, so I apologise if my question is trivial.

Anyway. here's my code:

# Mount drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

!pip install -r  /content/drive/MyDrive/TableNet/requirements.txt

!python /content/drive/MyDrive/TableNet/predict.py --model_weights='/content/drive/MyDrive/TableNet/best_model.ckpt' --image_path='/content/drive/MyDrive/TableNet/TablesImages/Test_table.png'

This is the error I get:

Traceback (most recent call last):
  File "/content/drive/MyDrive/TableNet/predict.py", line 19, in <module>
    from tablenet import TableNetModule
  File "/content/drive/MyDrive/TableNet/tablenet/__init__.py", line 3, in <module>
    from .marmot import MarmotDataModule
  File "/content/drive/MyDrive/TableNet/tablenet/marmot.py", line 7, in <module>
    import pytorch_lightning as pl
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 66, in <module>
    from pytorch_lightning import metrics
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.metric import Metric
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/metric.py", line 23, in <module>
    from pytorch_lightning.metrics.utils import _flatten, dim_zero_cat, dim_zero_mean, dim_zero_sum
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/utils.py", line 18, in <module>
    from pytorch_lightning.utilities import rank_zero_warn
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 24, in <module>
    from pytorch_lightning.utilities.apply_func import move_data_to_device
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 25, in <module>
    from torchtext.data import Batch
  File "/usr/local/lib/python3.7/dist-packages/torchtext/__init__.py", line 6, in <module>
    from . import experimental
  File "/usr/local/lib/python3.7/dist-packages/torchtext/experimental/__init__.py", line 2, in <module>
    from . import transforms
  File "/usr/local/lib/python3.7/dist-packages/torchtext/experimental/transforms.py", line 4, in <module>
    from torchtext._torchtext import RegexTokenizer as RegexTokenizerPybind
ImportError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo

I have to admit I have no idea what is causing the error. Could you please help me?

Thanks a lot and great work!

LivingDeadCloud commented 3 years ago

Ok, after some debugging I was able to run the code. If you want, I could post my code here, so that users that will have to use Google Colab have it ready.

Let me know!

Cheers

tomassosorio commented 3 years ago

@LivingDeadCloud, thanks! If you don't mind I would appreciate it!

Sorry, I did not have time to help you in between... :/

LivingDeadCloud commented 3 years ago

@LivingDeadCloud, thanks! If you don't mind I would appreciate it!

Sorry, I did not have time to help you in between... :/

Yeah no problem, I will post it next week!

LivingDeadCloud commented 3 years ago

Hey everyone

Sorry for the delay. I'm going to post the code now. Just notice that I haven't used the code ever since my original post, so it may need some small adjustment. Here's my code to run TableNet in Google Colab:

# Mount drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Install all requirements
!pip install -r /content/drive/MyDrive/TableNet/requirements.txt # Change this with the path to requirements.txt

# Install additional packages
!pip install tesseract
!pip install torchtext==0.8.0
!pip install torch==1.7.1 
!pip install pytorch-lightning==1.2.2
!pip install torchmetrics
!pip install deprecate
!apt install tesseract-ocr
!apt install libtesseract-dev

# Run the code
# python predict.py --model_weights='<weights path>' --image_path='<image path>' # Default command line
result = !python /content/drive/MyDrive/TableNet/predict.py --model_weights='/content/drive/MyDrive/TableNet/best_model.ckpt' --image_path='/content/drive/MyDrive/TableNet/TablesImages/Your_image.png' # Change paths to "predict.py" and "Your_image.png" according to your drive

# Look the result
result

Now, result is a IPython.utils.text.SList type variable, so here you may need some adjustments to predict.py function. However it should be pretty straightforward from here. If someone is willing to post their code to get result as a more useful type of variable, for exampel a Pandas dataframe, that would be great!

Cheers

tomassosorio commented 3 years ago

@LivingDeadCloud Thanks! I will add this to README.md