kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.47k stars 448 forks source link

Grobid with DL models natively on MacOS ARM #1108

Closed Schroedi closed 3 weeks ago

Schroedi commented 5 months ago

This is my attempt to use grobid on MacOS ARM. The docs state that MacOS is not fully supported so feel free to mark this issue as out of scope.

If anybody got it working, I would be interested in the package versions used.

Here I document what I tried and how far I got:

System

MacOS 14.4.1 (ARM M3) java --version

openjdk 17.0.10 2024-01-16 LTS
OpenJDK Runtime Environment Zulu17.48+15-CA (build 17.0.10+7-LTS)
OpenJDK 64-Bit Server VM Zulu17.48+15-CA (build 17.0.10+7-LTS, mixed mode, sharing)

Steps

#clone grobid
#cd grobid

# shared venv
uv venv -p 3.9
source .venv/bin/activate
uv pip install jep==4.2.0
cp .venv/lib/python3.9/site-packages/jep/jep.cpython-39-darwin.so grobid-home/lib/mac_arm-64/libjep.dylib

# prepare delft
# I think 0.3.3 is used in the container if I remember correctly
git clone --branch v0.3.3 https://github.com/kermitt2/delft
cd delft
# change requirements until delft works - very scientific
wget -O delftMacArm.patch http://sprunge.us/iFQCZx
git apply delftMacArm.patch
uv pip install -r requirements.txt
python setup.py build install
# test delft
# python delft/applications/grobidTagger.py date tag --architecture BidLSTM_CRF
# enjoy json output :)
cd ..

# build grobid
./gradlew clean install

# the patch edits grobid-home/config/grobid.yaml
# 1. change delft: install: "../delft" to: delft: install: "delft"
# 2. use delft models
wget -O grobidConf.patch http://sprunge.us/o8IbpR
git apply grobidConf.patch

# I had to include the path to the libpython from the venv here
java -Xmx4G -Djava.library.path=grobid-home/lib/mac_arm-64:/opt/homebrew/opt/python@3.9/Frameworks/Python.framework/Versions/3.9/lib -jar grobid-core/build/libs/grobid-core-0.8.1-SNAPSHOT-onejar.jar -gH grobid-home -dIn /Users/ascadian/Projects/paperSegmentation/train_data/raw -dOut /Users/ascadian/Projects/paperSegmentation/train_data/anno_raw_test  -exe createTraining

Output/Error

22:38:05.157 [main] INFO  org.grobid.core.main.GrobidHomeFinder - No Grobid property was provided. Attempting to find Grobid home in the current directory...
22:38:05.161 [main] INFO  org.grobid.core.main.GrobidHomeFinder - *** USING GROBID HOME: /Users/ascadian/Projects/grobid3/grobid-home
22:38:05.163 [main] INFO  org.grobid.core.main.GrobidHomeFinder - No Grobid property was provided. Attempting to find Grobid home in the current directory...
22:38:05.163 [main] INFO  org.grobid.core.main.GrobidHomeFinder - *** USING GROBID HOME: /Users/ascadian/Projects/grobid3/grobid-home
22:38:05.163 [main] INFO  org.grobid.core.main.GrobidHomeFinder - Grobid config file location was not explicitly set via 'org.grobid.config' system variable, defaulting to: /Users/ascadian/Projects/grobid3/grobid-home/config/grobid.yaml
22:38:05.280 [main] INFO  org.grobid.core.main.LibraryLoader - Loading external native sequence labelling library
22:38:05.286 [main] INFO  org.grobid.core.main.LibraryLoader - Loading Wapiti native library...
22:38:05.489 [main] INFO  org.grobid.core.main.LibraryLoader - Loading JEP native library for DeLFT... /Users/ascadian/Projects/grobid3/grobid-home/lib/mac_arm-64
22:38:05.640 [main] INFO  org.grobid.core.main.LibraryLoader - Native library for sequence labelling loaded
22:38:05.642 [main] INFO  org.grobid.core.lexicon.Lexicon - Initiating dictionary
22:38:05.642 [main] INFO  org.grobid.core.lexicon.Lexicon - End of Initialization of dictionary
22:38:05.642 [main] INFO  org.grobid.core.lexicon.Lexicon - Initiating names
22:38:05.642 [main] INFO  org.grobid.core.lexicon.Lexicon - End of initialization of names
22:38:05.885 [main] INFO  org.grobid.core.lexicon.Lexicon - Initiating country codes
22:38:05.888 [main] INFO  org.grobid.core.lexicon.Lexicon - End of initialization of country codes
.DS_Store
2004.03577.pdf
NeurIPS-2023-modelling-cellular-perturbations-with-the-sparse-additive-mechanism-shift-variational-autoencoder-Paper-Conference.pdf
s41586-024-07303-5.pdf
fpsyg-07-00789.pdf
4 files to be processed.
/Users/ascadian/Projects/paperSegmentation/train_data/raw/2004.03577.pdf
[Wapiti] Loading model: "/Users/ascadian/Projects/grobid3/grobid-home/models/fulltext/model.wapiti"
Model path: /Users/ascadian/Projects/grobid3/grobid-home/models/fulltext/model.wapiti
[Wapiti] Loading model: "/Users/ascadian/Projects/grobid3/grobid-home/models/segmentation/model.wapiti"
Model path: /Users/ascadian/Projects/grobid3/grobid-home/models/segmentation/model.wapiti
22:38:09.792 [main] INFO  org.grobid.core.jni.DeLFTModel - Loading DeLFT model for reference-segmenter with architecture BidLSTM_ChainCRF_FEATURES...
22:38:09.794 [pool-1-thread-1] INFO  org.grobid.core.jni.JEPThreadPool - Creating JEP instance for thread 19
WARNING: Failed to get and cache frequent class types!
WARNING: Failed to get and cache primitive class types!
22:38:09.846 [pool-1-thread-1] ERROR org.grobid.core.jni.JEPThreadPool - JEP initialisation failed
22:38:09.878 [pool-1-thread-1] INFO  org.grobid.core.jni.JEPThreadPool - Creating JEP instance for thread 19
WARNING: Failed to get and cache frequent class types!
WARNING: Failed to get and cache primitive class types!
22:38:09.879 [pool-1-thread-1] ERROR org.grobid.core.jni.JEPThreadPool - JEP initialisation failed
22:38:09.884 [main] ERROR org.grobid.core.jni.DeLFTModel - DeLFT model reference_segmenter labelling failed
java.util.concurrent.ExecutionException: java.lang.RuntimeException: JEP initialisation failed
    at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
    at org.grobid.core.jni.JEPThreadPool.call(JEPThreadPool.java:176)
    at org.grobid.core.jni.DeLFTModel.label(DeLFTModel.java:194)
    at org.grobid.core.engines.tagging.DeLFTTagger.label(DeLFTTagger.java:29)
    at org.grobid.core.engines.AbstractParser.label(AbstractParser.java:47)
    at org.grobid.core.engines.ReferenceSegmenterParser.createTrainingData(ReferenceSegmenterParser.java:334)
    at org.grobid.core.engines.FullTextParser.createTraining(FullTextParser.java:1153)
    at org.grobid.core.engines.Engine.createTraining(Engine.java:551)
    at org.grobid.core.engines.Engine.batchCreateTraining(Engine.java:655)
    at org.grobid.core.engines.ProcessEngine.createTraining(ProcessEngine.java:376)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:344)
    at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:194)
Caused by: java.lang.RuntimeException: JEP initialisation failed
    at org.grobid.core.jni.JEPThreadPool.createJEPInstance(JEPThreadPool.java:135)
    at org.grobid.core.jni.JEPThreadPool.getJEPInstance(JEPThreadPool.java:151)
    at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:119)
    at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:84)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
22:38:09.886 [main] ERROR org.grobid.core.engines.Engine - An error occured while processing the following pdf: /Users/ascadian/Projects/paperSegmentation/train_data/raw/2004.03577.pdf
org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid training data generation for full text.
    at org.grobid.core.engines.FullTextParser.createTraining(FullTextParser.java:1562)
    at org.grobid.core.engines.Engine.createTraining(Engine.java:551)
    at org.grobid.core.engines.Engine.batchCreateTraining(Engine.java:655)
    at org.grobid.core.engines.ProcessEngine.createTraining(ProcessEngine.java:376)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:344)
    at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:194)
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.commons.lang3.tuple.Pair.getLeft()" because "result" is null
    at org.grobid.core.engines.FullTextParser.createTraining(FullTextParser.java:1154)
    ... 9 common frames omitted

Used patches (in case the pastebin is unavailable)

grobidConf.patch delftMacArm.patch

lfoppiano commented 5 months ago

For tensorflow on ARM Apple, you should install tensorflow-deps using conda (https://github.com/lfoppiano/material-parsers?tab=readme-ov-file#set-up-on-apple-m1, you can stop before the spacy model download stuff - same scientific approach 😄 )

I use usually Conda and I install most of the packages with pip unless they are particularly annoying (e.g. try to compile -fail - etc...)

The JEP library should not need to be copied under the grobid-home because the version in the python env should be used directly. For doing that you should export the equivalent of CONDA_PREFIX directory corresponding to VENV before running grobid.

Schroedi commented 4 months ago

Thanks for taking your time. I tried using conda but was not successful. It's the same situation as before. About the CONDA_PREFIX I was not completely sure what you meant. It points to the venv's rood directory. Adding that to the java.library.path does not make any difference for me. And conda already exports it.

I will continue to use my remote linux machine for now. So feel free to close this issue.

Just for the record, here is what I did:

Error:

[...]
14:34:45.795 [main] INFO  org.grobid.core.main.LibraryLoader - Loading JEP native library for DeLFT... /Users/ascadian/Projects/grobid4/grobid-home/lib/mac_arm-64
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:344)
    at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:194)
Caused by: java.lang.UnsatisfiedLinkError: no jep in java.library.path: grobid-home/lib/mac_arm-64:/opt/homebrew/Caskroom/miniforge/base/envs/grobidEnv2/lib/
    at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2434)
    at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:818)
    at java.base/java.lang.System.loadLibrary(System.java:1993)
    at org.grobid.core.main.LibraryLoader.load(LibraryLoader.java:158)
    at org.grobid.core.factory.AbstractEngineFactory.init(AbstractEngineFactory.java:72)
    at org.grobid.core.factory.GrobidFactory.<init>(GrobidFactory.java:19)
    at org.grobid.core.factory.GrobidFactory.newInstance(GrobidFactory.java:73)
    at org.grobid.core.factory.GrobidFactory.getInstance(GrobidFactory.java:30)
    at org.grobid.core.engines.ProcessEngine.getEngine(ProcessEngine.java:46)
    at org.grobid.core.engines.ProcessEngine.createTraining(ProcessEngine.java:376)
    ... 6 more

Adding the jep dir back to the java.library.path or copying the lib as before results in the same situation as in the original comment. (Jep init failed)

lfoppiano commented 4 months ago

About the CONDA_PREFIX I was not completely sure what you meant. It points to the venv's rood directory. Adding that to the java.library.path does not make any difference for me. And conda already exports it.

You need to export it before running the gradle command. That should be enough, however did you install jep in your conda environment? What's your pip list?

it looks like it's searching in the wrong directory

14:34:45.795 [main] INFO  org.grobid.core.main.LibraryLoader - Loading JEP native library for DeLFT... /Users/ascadian/Projects/grobid4/grobid-home/lib/mac_arm-64

while it should load jep from your conda environment

lfoppiano commented 3 weeks ago

@Schroedi I'm closing this, if you have still problems feel free to reopen/comment.