pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.02k stars 43 forks source link

`compute_language_confidence_values_in_parallel` crashes with big dataset #191

Closed gilbertfl closed 7 months ago

gilbertfl commented 8 months ago

I am using compute_language_confidence_values_in_parallel to process a few million tweet samples.

For a smaller set (203k lines), it takes a little time but works well - all my cores top to 100% and the results are fine.

For a bigger set it crashes after about 400k lines, even if I do the process by 100k chunks (this is how I estimate the 400k limit).

Here is the outline of my code:

def detecte_langue(tweet_series:pandas.Series)->pandas.Series:
    detector = LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build()
    chunksize = 100000
    results = list()
    for thischunk in chunker(tweet_series, chunksize):
        x =  numpy.array(detector.compute_language_confidence_values_in_parallel(thischunk))
        results.append(x[:,0])
    return pandas.concat(results)

The crash happens on the compute_language_confidence_values_in_parallel line. The memory usage stays 1 Gb under system total, but before crashing I get a long spike of 100% disk activity. My first version of this function did no do chunking and also crashed.

There seems to be something that is not purged between executions of compute_language_confidence_values_in_parallel. Should I do something between chunks?

pemistahl commented 8 months ago

Can you please post the exact error and stacktrace you get? Without more info, I'm not able to help you.

Parallel execution is entirely handled by Rayon, a data-parallelism library for Rust. It makes probably sense for you to look into its repository and documentation.

gilbertfl commented 8 months ago

I'm sorry, I won't be able to help with any Rust. I'm using the latest Python library (lingua-language-detector 2.0.0) and Python 3.9.18.

I tried executing the routine directly from the command line, and the only thing I get in the console is "Killed". It's not a lot more helpful than the "kernel crash" error I get with jupyter. Still, I counted 8 rounds through the loop, which is better than the 3-4 I got with Jupyter.

Is there some way to have the library generate a trace file?

pemistahl commented 8 months ago
  1. Did you try the single-threaded method compute_language_confidence_values? Does this one work well on your dataset?
  2. Is there any other custom thread pool running on your machine while running Lingua?
  3. What is your operating system and your CPU? How many CPU cores do you have?

Is there some way to have the library generate a trace file?

I don't think so. Perhaps you can wrap your Python code in a try...except to log the error somehow.

gilbertfl commented 8 months ago

I did not try the single-threaded method on the complete dataset, but it works if I select random lines. Looping on 8M lines seems like a waste when I have 12 cores available.

There is no custom pool: I do not use python threading for this program, or any other threading-enabled library (like Dask). The only special thing is that my Pandas comes from the Intel OneAPI repo, so it may have some threading integrated.

My CPU is a 6-core, 12-thread Intel Core i7-10710U mobile CPU.

pemistahl commented 8 months ago

Can you share your dataset with me? Otherwise I don't know how to reproduce your problem without having any stacktrace. I could feed your data into a Rust program running Lingua to be able to debug and get more details about the error.

gilbertfl commented 8 months ago

I can't decide to share the dataset with you, but the original researchers might be willing to do it. I would have to share some prep code with you so you can start at the same point as I am.

Here is the research page: http://mib.projects.iit.cnr.it/dataset.html I am processing their "genuine accounts" group.

pemistahl commented 8 months ago

Sounds good. I'm definitely willing to help but, as I said, I need more information and / or data.

gilbertfl commented 8 months ago

I have created a GitHub repo with the data prep and minimal reproduction code and shared it with you. Give me news when you get the data.

gilbertfl commented 7 months ago

I did a dry run, and I think processing the data is not the problem by itself.

At first I tried reducing chunk size, which changed the endpoint: after 475k with 25k chunks, after 520k with 10k chunks, etc.

I commented out the results.append(x[:,0]) line and reduced the chunk size to 1000. In this configuration, the loop went over the whole 8.37M lines without error. As soon as I put this line back in, I get a crash right after 483k lines.

I see a few (surprising) possibilities: something in what you return does not sit well with numpy, numpy slicing crashes, or making a list of numpy arrays makes Python crash for some reason. The last possibility is far-fetched since at 25k chunks the list has only 19 items when the crash happens.

gilbertfl commented 7 months ago

I'm thinking out loud here.

So I'm slicing big 2D void arrays, when it could be a 3D array of (sample, language, probability). Is there some interface to convert a ConfidenceValue to a tuple (string, float)?

Last thing: what I'm doing by slicing is taking the best guess with it's probability, for each sample. All other classes are not really interesting to me. I would suggest for a future version is to create a compute_top_n_language_confidence_values kind of function for a generalized replacement of my loop.

pemistahl commented 7 months ago

Interesting. I was actually wondering already whether NumPy could be the problem here. Anyway, it does not make much sense to store entire ConfidenceValues in NumPy arrays. They are meant for plain numerical data, so you should convert the outcome of Lingua appropriately before storing it in the arrays.

Is there some interface to convert a ConfidenceValue to a tuple (string, float)?

No, there isn't but perhaps it's a good idea to add a method like to_tuple() to the ConfidenceValue struct in order to make working with NumPy and Pandas more pleasant. I will think about it.