import talon
from talon import quotations
from talon import signature
from talon.signature.bruteforce import extract_signature
import os
import resource
#Both environment variables are expressed in mb
soft = os.getenv('PYTHON_RLIMIT_DATA_SOFT', default='256')
hard = os.getenv('PYTHON_RLIMIT_DATA_HARD', default='512')
resource.setrlimit(resource.RLIMIT_DATA, (int(soft) * 1048576, int(hard) * 1048576))
# don't forget to init the library first
# it loads machine learning classifiers
talon.init()
return quotations.split_emails("""Reply
-----Original Message-----
Quote""")
And how I was calling this:
/**
* A thread local instance of the Jep library. This is required to be thread local
* as <a href="https://github.com/mrj0/jep/wiki/Performance-Considerations">
* Jep will only execute calls on the thread it was instantiated on</a>
* and <a href=" https://github.com/mrj0/jep/issues/28"> closing the Jep instance breaks the Numpy Python Library.</a>
* Because of these two issues all Worker threads will call to a separate Jep thread.
*/
private static final ThreadLocal<Jep> threadLocal = new ThreadLocal<Jep>() {
@Override
protected Jep initialValue() {
try {
return new Jep(false, null, null, new ClassEnquirerImpl());
} catch (JepException e) {
throw new RuntimeException(e);
}
}
@Override
public void remove() {
Jep jep = this.get();
if (jep != null) {
try {
jep.close();
} catch (JepException ex) {
throw new RuntimeException(ex);
}
}
super.remove();
}
};
public static List<Integer> splitEmail(String message) throws JepException {
List<Integer> emailStartLineNumbers = new ArrayList<>();
Jep jep = threadLocal.get();
if (jep == null) {
return emailStartLineNumbers;
}
jep.eval("import split_email");
jep.set("arg", message);
jep.eval("x = split_email.splitEmail(arg)");
Object lineMarkers = jep.getValue("x");
jep.eval("del x");
jep.eval("del arg");
if (lineMarkers instanceof String) {
char[] markers = ((String) lineMarkers).toCharArray();
int size = markers.length;
for (int i = 0; i < size; i++) {
if (markers[i] == 's') {
emailStartLineNumbers.add(i);
}
}
} else {
throw new RuntimeException("Unexpected return type from Python when separating email messages.");
}
return emailStartLineNumbers;
}
This all worked fine, however, I am now in the process of updating things to use Python 3.10 (openSUSE Tumbleweed), along with Jep 4.0.3. My updated dependencies look like this:
# Pinning version of scikit-learn to 1.0.1 to avoid this error: "Trying to unpickle estimator LinearSVC from version 1.0.1 when using version 1.1.2."
RUN zypper -n refresh && \
zypper -n update && \
zypper -n install python3-devel && \
zypper -n install python3-pip && \
zypper -n install python3-matplotlib && \
zypper -n install zlib-devel && \
zypper -n install python3-numpy-devel && \
zypper -n install python3-lxml && \
zypper -n install python3-scipy && \
zypper -n install gcc-c++ && \
pip install scikit-learn==1.0.1 && \
pip install regex==2022.6.2 && \
pip install -U https://github.com/mailgun/talon/archive/refs/tags/v1.6.0.zip && \
pip install jep==4.0.3 && \
zypper -n clean --all
ENV LD_PRELOAD=/usr/lib64/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so
ENV LD_LIBRARY_PATH=/usr/lib/python3.10/site-packages/talon:/usr/lib64/python3.10/site-packages/jep
Due the Jep now being an abstract class in 4.0.3, I also updated my Java code to use SubInterpreter rather than Jep:
private static final ThreadLocal<Jep> threadLocal = new ThreadLocal<Jep>() {
@Override
protected Jep initialValue() {
try {
final JepConfig jepConfig = new JepConfig();
jepConfig.setIncludePath(null);
jepConfig.setClassLoader(null);
jepConfig.setClassEnquirer(new ClassEnquirerImpl());
return new SubInterpreter(jepConfig);
} catch (JepException e) {
throw new RuntimeException(e);
}
}
@Override
public void remove() {
Jep jep = this.get();
if (jep != null) {
try {
jep.close();
} catch (JepException ex) {
throw new RuntimeException(ex);
}
}
super.remove();
}
};
However, when I run my Java code now, I am seeing the JVM crash with the following error. This appears to happen after the split_email.py has been called:
12:45:31.452 worker-markup-fs> INFO [2022-10-11 11:45:31,452] com.github.cafdataprocessing.worker.markup.core.EmailSplitter: Starting email splitting based on document received
12:45:31.960 worker-markup-fs> /usr/lib/python3.10/site-packages/talon/signature/extraction.py:7: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-interpreters. This will likely work for most users but might cause hard to track down issues or subtle bugs. A common user of the rare sub-interpreter feature is wsgi which also allows single-interpreter mode.
12:45:31.960 worker-markup-fs> Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
12:45:31.960 worker-markup-fs> import numpy
12:45:32.950 worker-markup-fs> INFO [2022-10-11 11:45:32,949] com.github.cafdataprocessing.worker.markup.core.EmailSplitter: Email Splitting completed
12:45:32.950 worker-markup-fs> INFO [2022-10-11 11:45:32,949] com.github.cafdataprocessing.worker.markup.core.MarkupHeadersAndBody: Starting markup of Headers and Body
12:45:32.956 worker-markup-fs> OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f6a94e9f000, 16384, 0) failed; error='Not enough space' (errno=12)
12:45:32.956 worker-markup-fs> [31.713s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
12:45:32.956 worker-markup-fs> #
12:45:32.956 worker-markup-fs> # There is insufficient memory for the Java Runtime Environment to continue.
12:45:32.956 worker-markup-fs> # Native memory allocation (mmap) failed to map 16384 bytes for committing reserved memory.
Is there anything I'm doing wrong that could be causing this?
I did try using SharedInterpreter instead of SubInterpreter but saw the same error. Going by this:
I think SubInterpreter is what I should be using, given how I used Jep instances previously.
I did note this warning in the log as well, but not sure if its related to the JVM crash or not:
/usr/lib/python3.10/site-packages/talon/signature/extraction.py:7: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-interpreters. This will likely work for most users but might cause hard to track down issues or subtle bugs. A common user of the rare sub-interpreter feature is wsgi which also allows single-interpreter mode.
I was previously using the following version of Jep (3.8.2) and dependencies without issues, using openSUSE Leap15.4, which is running Python 2.7:
Example of the script I am calling from Java:
split_email.py
And how I was calling this:
This all worked fine, however, I am now in the process of updating things to use Python 3.10 (openSUSE Tumbleweed), along with Jep 4.0.3. My updated dependencies look like this:
Due the
Jep
now being an abstract class in 4.0.3, I also updated my Java code to useSubInterpreter
rather thanJep
:However, when I run my Java code now, I am seeing the JVM crash with the following error. This appears to happen after the split_email.py has been called:
Is there anything I'm doing wrong that could be causing this?
I did try using
SharedInterpreter
instead ofSubInterpreter
but saw the same error. Going by this:https://github.com/ninia/jep/wiki/SharedInterpreter-vs-SubInterpreter#which-should-i-use
I think
SubInterpreter
is what I should be using, given how I usedJep
instances previously.I did note this warning in the log as well, but not sure if its related to the JVM crash or not:
Many thanks