huridocs / pdf_paragraphs_extraction

MIT License
48 stars 7 forks source link

pdf_features and a few other libraries are not imported #118

Open asleroid opened 3 months ago

asleroid commented 3 months ago

Even though pdf_features is in the installed libraries within venv, running 'pip list' does not return the library.

As a result, when running the following command, the script errors out: (venv) asleroid@Aslis-MBP pdf_paragraphs_extraction % python src/create_paragraph_extractor_model.py /Users/asleroid/Code/pdf-labeled-data/labeled_data/paragraph_extraction loading one_column_test from /Users/asleroid/Code/pdf-labeled-data/labeled_data/paragraph_extraction/one_column_test Traceback (most recent call last): File "/Users/asleroid/Code/pdf_paragraphs_extraction/src/create_paragraph_extractor_model.py", line 25, in <module> train_model() File "/Users/asleroid/Code/pdf_paragraphs_extraction/src/create_paragraph_extractor_model.py", line 12, in train_model pdf_paragraph_tokens_list = load_labeled_data(PDF_LABELED_DATA_ROOT_PATH) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/asleroid/Code/pdf_paragraphs_extraction/src/paragraph_extraction_trainer/load_labeled_data.py", line 34, in load_labeled_data pdf_paragraph_tokens = PdfParagraphTokens.from_labeled_data(pdf_labeled_data_root_path, dataset, pdf_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/asleroid/Code/pdf_paragraphs_extraction/src/paragraph_extraction_trainer/PdfParagraphTokens.py", line 29, in from_labeled_data pdf_features = PdfFeatures.from_labeled_data(pdf_labeled_data_root_path, dataset, pdf_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/asleroid/Code/pdf_paragraphs_extraction/venv/lib/python3.11/site-packages/pdf_features/PdfFeatures.py", line 126, in from_labeled_data pdf_features.set_token_types(token_type_labels) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'set_token_types'

gabriel-piles commented 3 months ago

Thank you for reaching out.

The PdfFeatures class is inside the pdf-tokens-type-labeler package. You can install this package using the following command

pip install git+https://github.com/huridocs/pdf-tokens-type-labeler@1c12c368887372164ab4981c3277a49e9dc43b9a

Let us know if this solves your problem.