Previously the raw DCC texts were preprocessed in the emc-dcc-preprocessing repo using a spacy pipeline. Then that pipeline and the preprocessed texts had to be loaded in here.
Now wrote a separate script to make the pipeline (build-pipeline.py), starting from tokenisation up to and including the context algorithm. That pipeline can thus be run on the raw text.
I also rewrote the context.ipynb notebook to use this new pipeline. This shouldn't have impacted the actual predictions, and I confirmed that the performance scores indeed did not change.
Previously the raw DCC texts were preprocessed in the emc-dcc-preprocessing repo using a spacy pipeline. Then that pipeline and the preprocessed texts had to be loaded in here.
Now wrote a separate script to make the pipeline (
build-pipeline.py
), starting from tokenisation up to and including the context algorithm. That pipeline can thus be run on the raw text.I also rewrote the
context.ipynb
notebook to use this new pipeline. This shouldn't have impacted the actual predictions, and I confirmed that the performance scores indeed did not change.Closes #5.