harmonydata / harmony

The Harmony Python library: a research tool for psychologists to harmonise data and questionnaire items. Open source.
https://harmonydata.ac.uk
MIT License
7 stars 12 forks source link

Integrate new non-spacy Pdf parsing into main Harmony #39

Closed woodthom2 closed 1 month ago

woodthom2 commented 3 months ago

Description

We have a draft improvement to the PDF parsing logic. This will enable us to eliminate Spacy as a dependency.

The training code is here: https://github.com/harmonydata/pdf-text-models-amol

The API modification is here https://github.com/harmonydata/harmonyapi branch nospacy

The modification to the main python library is in

git clone -b updated_files_for_forntend https://github.com/Notysoty/harmony.git 

Please quality control this branch and then merge it into main in all repositories and remove spacy from all requirements.txt and toml files.

Rationale

Pdf extraction needs improvement

woodthom2 commented 1 month ago

Switched to Sklearn CRF Suite