Closed gilbertfl closed 7 months ago
Can you please post the exact error and stacktrace you get? Without more info, I'm not able to help you.
Parallel execution is entirely handled by Rayon, a data-parallelism library for Rust. It makes probably sense for you to look into its repository and documentation.
I'm sorry, I won't be able to help with any Rust. I'm using the latest Python library (lingua-language-detector 2.0.0) and Python 3.9.18.
I tried executing the routine directly from the command line, and the only thing I get in the console is "Killed". It's not a lot more helpful than the "kernel crash" error I get with jupyter. Still, I counted 8 rounds through the loop, which is better than the 3-4 I got with Jupyter.
Is there some way to have the library generate a trace file?
compute_language_confidence_values
? Does this one work well on your dataset?Is there some way to have the library generate a trace file?
I don't think so. Perhaps you can wrap your Python code in a try...except
to log the error somehow.
I did not try the single-threaded method on the complete dataset, but it works if I select random lines. Looping on 8M lines seems like a waste when I have 12 cores available.
There is no custom pool: I do not use python threading for this program, or any other threading-enabled library (like Dask). The only special thing is that my Pandas comes from the Intel OneAPI repo, so it may have some threading integrated.
My CPU is a 6-core, 12-thread Intel Core i7-10710U mobile CPU.
Can you share your dataset with me? Otherwise I don't know how to reproduce your problem without having any stacktrace. I could feed your data into a Rust program running Lingua to be able to debug and get more details about the error.
I can't decide to share the dataset with you, but the original researchers might be willing to do it. I would have to share some prep code with you so you can start at the same point as I am.
Here is the research page: http://mib.projects.iit.cnr.it/dataset.html I am processing their "genuine accounts" group.
Sounds good. I'm definitely willing to help but, as I said, I need more information and / or data.
I have created a GitHub repo with the data prep and minimal reproduction code and shared it with you. Give me news when you get the data.
I did a dry run, and I think processing the data is not the problem by itself.
At first I tried reducing chunk size, which changed the endpoint: after 475k with 25k chunks, after 520k with 10k chunks, etc.
I commented out the results.append(x[:,0])
line and reduced the chunk size to 1000. In this configuration, the loop went over the whole 8.37M lines without error. As soon as I put this line back in, I get a crash right after 483k lines.
I see a few (surprising) possibilities: something in what you return does not sit well with numpy, numpy slicing crashes, or making a list of numpy arrays makes Python crash for some reason. The last possibility is far-fetched since at 25k chunks the list has only 19 items when the crash happens.
I'm thinking out loud here.
void
dtype - not ideal but it's thereSo I'm slicing big 2D void
arrays, when it could be a 3D array of (sample, language, probability). Is there some interface to convert a ConfidenceValue to a tuple (string, float)?
Last thing: what I'm doing by slicing is taking the best guess with it's probability, for each sample. All other classes are not really interesting to me. I would suggest for a future version is to create a compute_top_n_language_confidence_values
kind of function for a generalized replacement of my loop.
Interesting. I was actually wondering already whether NumPy could be the problem here. Anyway, it does not make much sense to store entire ConfidenceValue
s in NumPy arrays. They are meant for plain numerical data, so you should convert the outcome of Lingua appropriately before storing it in the arrays.
Is there some interface to convert a ConfidenceValue to a tuple (string, float)?
No, there isn't but perhaps it's a good idea to add a method like to_tuple()
to the ConfidenceValue
struct in order to make working with NumPy and Pandas more pleasant. I will think about it.
I am using
compute_language_confidence_values_in_parallel
to process a few million tweet samples.For a smaller set (203k lines), it takes a little time but works well - all my cores top to 100% and the results are fine.
For a bigger set it crashes after about 400k lines, even if I do the process by 100k chunks (this is how I estimate the 400k limit).
Here is the outline of my code:
The crash happens on the
compute_language_confidence_values_in_parallel
line. The memory usage stays 1 Gb under system total, but before crashing I get a long spike of 100% disk activity. My first version of this function did no do chunking and also crashed.There seems to be something that is not purged between executions of
compute_language_confidence_values_in_parallel
. Should I do something between chunks?