gsireesh / ht-max

Code for the HT-MAX project
Apache License 2.0
0 stars 1 forks source link

Collage

Code for the Collage Tool, a part of the HT-MAX project.

Collage is a tool designed for rapid prototyping, visualization, and evaluation of different information extraction models on scientific PDFs. Further, we enable both non-technical users and NLP practitioners to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing.

This demo should be available and running at this URL. This server can sometimes be unstable. If it is having issues when you try to access it, please follow the Docker Compose instructions below.

Setup/Running the Demo

For convenience, we've Dockerized all of the components of this system. To get started with the demo, simply run:

docker compose up

In the root directory of the repo. On our machines, this takes ~20 minutes to complete, largely because of ChemDataExtractor having to download a number of models. If you do not need ChemDataExtractor, of want to speed uip the build process significantly, comment out the chemdataextractor service from compose.yaml. This sets up a Docker Compose network with three containers: the interface, an instance of GROBID, to get reading order sections, and the ChemDataExtractor service.

Alternatively, you can run the interface and Grobid separately. To build the interface docker image, run from the repo root:

docker build -t collage_interface .

And run the Grobid image with the command from their documentation:

docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0

(note that this command requires nvidia-docker. Grobid runs fine without it, you can just remove --gpus all from the command.)

To configure the application, modify config in app_config.py - this allows you to specify the Grobid and ChemDataExtractor URLs, API keys for either LLM services or MathPix

What's in this repo?

Collage has three primary components:

Extending Collage by implementing interfaces

This repo contains the interfaces discussed above, along with several implementations of those repositories. These implementations provide the blueprint for how to implemen,t the interfaces in a number of different ways, including in-memory implementations right in the pipeline, small, Dockerized services for components with complicated environment requirements that may not be compatible with Collage, as well as a few that use external APIs. We outline these components, and how they implement their interface below. Note that because Collage is a prototyping tool, it does not aim for efficiency: all models are run on CPU. At the level of one paper, which is what the interface allows, the pipeline takes around a minute to annotate a paper.

Each interface requires users to specify the following:

Finally, for a new implementation to be visualized in the frontend, it must be registered in local_model_config.py, by adding a new LocalModelInfo object to the MODEL_LIST object, which contains a model name, a description, and a function that takes no parameters that returns an instance of the predictor. New parameters to be passed to predictor constructors should be declared in app_config.py

Token Classification Interface - TokenClassificationPredictorABC:

This interface is intended for any model that produces annotations of spans in text, i.e. most "classical" NER or event extraction models. Users are required to override the following methods:

Current implementations:

Text Prediction Interface - TextGenerationPredictorABC:

Given the prominence of large language model-based approaches, this interface is designed to allow for text-to-text prediction. This interface can be extended by:

Current implementation:

Image Prediction Interface - ImagePredictorABC:

Given the focus on tables and charts that many of our interview participants discussed, and the fact that table parsing is an active research area, we additionally provide an interface for models that parse images, the ImagePredictorABC in order to handle multimodal processing, including tables. Predictors that implement this interface return an output in the form of an ImagePredictionResult, a union type that allows users to return any combination of a raw prediction string, a dict that represents a table, a list of bounding boxes, or a predicted string. All of these representations, if present, are rendered in the frontend view.

This interface allows users two options of method to override:

Current Implementations:

Other implemented components:

Scripts

This repo contains the following scripts:

parse_papers_to_json.py: The script parses the content from PDFs into structured representations in json. Currently, it runs the MaterialsRecipe on a specified folder of papers, and dumps the json representations to the specified output folder.

Notebooks

To aid development, this repo contains two notebooks that facilitate quicker development of PaperMage predictors. dev_run_recipe_and_serialize.ipynb takes a new PDF, runs that MaterialsRecipe on it, and serializes the result. dev_run_recipe_and_serialize opens a paper from the parsed json, and allows further manipulation.

[CMU Collaborators] Getting and using data

The testing data for this project is managed and versioned by DVC, and it is stored in this Google Drive folder. Data and checkpoints should be stored in the data/ folder. For this project, we are symlinking in the PDF data that we store in the NLP Collaboration Box Folder, e.g.:

ln -s $BOX_SYNC_FOLDER/NLP-collaboration-folder/AM_Creep_Papers data/AM_Creep_Papers

Data derived from those PDFs, model checkpoints, etc. will be stored in the data/ folder and managed with DVC.

You can find instructions for installing DVC here. Once you have DVC installed, run dvc pull from the root of the repo. This will pull down all the files that have been checked into DVC thus far. This will ask for permission for DVC to access the files in your Google Drive; you should proceed with your CMU account.

DVC works in a similar fashion to git-lfs: it stores pointers and metadata for your data in the git repository, while the files live elsewhere (in this case, on Google Drive). As you work with data, such as in the DVC tutorial, DVC will automatically add the files you have tracked with it to the .gitignore file, and add new .dvc files that track the metadata associated with those files.

Sample Workflow

tl;dr: