Remote code execution vulnerability in NLTK

Dunedan commented 3 months ago

The current and earlier versions of NLTK are vulnerable to a remote code execution vulnerability when using the integrated data package download functionality. A man-in-the-middle attacker or an attacker with control over the NLTK data index can force users which use data packages with pickled Python code to download a new version of the package which executes arbitrary code upon unpickling.

Data packages which have been identified to be vulnerable are "averaged_perceptron_tagger" and "punkt". For code to be vulnerable it has to download a data package and use functionality in NLTK which causes the data package to be unpickled.

Here is an example of vulnerable code for the "averaged_perceptron_tagger" data package:

import nltk
nltk.download("averaged_perceptron_tagger")
nltk.pos_tag(["hello", "world"])

This vulnerability was reported together with POC code to exploit it multiple times to the NLTK team via the email address mentioned in SECURITY.md (and later directly to some maintainers as well). It was reported first on 2024-05-19. So far there has been no response from the NLTK maintainers to these reports.

https://github.com/nltk/nltk/issues/2522 already pointed out the security implications of using pickled code a few years ago, but didn't receive a response either.

ekaf commented 3 months ago

@Dunedan, what would you suggest?
Maybe print a warning and ask for confirmation before loading a pickle?

The pickles in question contain Python classes with executable functions. They are not data but programs. And,as a compiled format, they are not ideal for an open-source project.

But an eventual attacker needs to trick the user into loading a malicious pickle.

Dunedan commented 3 months ago

The short answer to solve this issue is to remove the download of code via network at runtime.

If that's not feasible for NLTK I have no clear answer how to solve this issue. However, there are some measures I'd implement, which would at least improve the situation:

Remove use of pickled code wherever possible

While I haven't checked what the code in in the data packages with pickled code does, I can imagine that it's not the code which is large, but rather data associated with it. Maybe the algorithms implemented in there can be moved into NLTK, removing the need for pickled code.

Also instead of using pickle, a domain-specific format could be used (depending on the use case something like ONNX (https://onnx.ai/) might fit for that).

Make pickled code reproducible

When an update for a data package with pickled code is issued, it's difficult to figure out what changed. Having the pickle files be reproducibly built would be helpful for assessing if they're legit.

Make documentation clear for download once during setup

While the documentation details various ways to install data packages right now, making it more prominent that data packages should be installed once during installation/setup, instead of including Downloader.download() in application code, could be helpful to make users aware of the implications downloading data packages in their application code has.

Opt-in for auto-updating installed data packages

Right now calling Downloader().download() always results in the data index being fetched and potential updates to already installed packages being downloaded and installed. Instead I believe the download functionality should by default only do network requests and install data packages if they're not already installed. Automatic updating could then be an additional parameter users have to explicitly set.

Opt-in for installing data packages containing code

Similar to having users explicitly opt-in for auto-updates, I'd do the same for downloads of data packages which contain code which is going to be executed.

In addition I'd always log a warning when a package with pickled code gets downloaded or used.

Verify downloaded packages

Right now the data index contains a checksum per package. However this checksum is only used to check whether a downloaded package differs from a previously downloaded version. It is not used right now to check if a downloaded package is the one referenced by the data index. This should be done together with the move to a cryptographical strong hash function for the checksums.

ekaf commented 3 months ago

Thanks @dunedan for your detailed suggestions.

Should this issue be labelled critical? I wonder if it could make NLTK deemed unsafe for use in schools, or inclusion in some software distributions. But on the other hand, has any exploit been reported since #2522, back in 2020? When nobody else than the administrator controls nltk_data, the pickles should be safe to load.

It could be nice to remove all the pickles completely, for ex. by moving the code into the classes of the corresponding corpus reader in NLTK, and reducing the data package, for ex. to CSV files for tabular data, and JSON or XML for complex data.

Dunedan commented 3 months ago

This is now being tracked as CVE-2024-39705.

ekaf commented 3 months ago

In addition to the two packages already mentioned, the following also contain pickles:

chunkers/maxent_ne_chunker.zip
help/tagsets.zip
taggers/maxent_treebank_pos_tagger.zip

In total nltk_data contains 52 pickles, where half are Python 2, which is not supported anymore, so these may just be deleted.

Starting at the easy end, the pickles in help/tagsets.zip are very easy to decompile, since they are just Python dictionaries that associate each tag symbol with a tuple consisting in the tag name, and a string of instances. This is just tabular data, that would fit very well in the CSV or TAB format. One may wonder what made it seem attractive to compile this data into pickles back then.

DhenPadilla commented 3 months ago

Hey team!!!

Thanks for raising this @Dunedan and for the insightful thread @ekaf!

We hit a critical dependabot alert regarding this vulnerability. So I'm now tracking this thread quite closely! I was trying to trace down into its sub dependencies but got super lost with where to be looking in order to verify that these vulnerable data packages are being loaded into our codebase;

Sorry for the silly questions here; but would love to understand this a little more;

A man-in-the-middle attacker or an attacker with control over the NLTK data index can force users which use data packages with pickled Python code to download a new version of the package which executes arbitrary code upon unpickling.

This is the first time I've heard of pickling! So thank you for raising this and introducing me to this new concept for me. So, my understanding of the vulnerability here is that the pickled byte-stream can be manipulated before unpickling the content which may include malicious executable code?
We're currently using the LazyCorpusLoader and we download the words package. My understanding here is that the download() function is a main vulnerable entry-point here? To add fuel to this fire; we use this within an AWS lambda function which means we would run a download every single run.. making this really open :/ - What's the best work around/fix here for us (if needed)?
How can I tell whether the words package is a pickled download? Or for a broader question; what's the best way to trace which datasets are prone to this vulnerability?

ekaf commented 3 months ago

#!/bin/sh
for f in ~/nltk_data/*/*zip
do
unzip -l $f | grep -i pickle >> Nltk_Pickles.txt
done

Nltk_Pickles.txt

mcepl commented 3 months ago

On Mon Jul 1, 2024 at 1:35 PM CEST, DhenPadilla wrote:

This is the first time I've heard of pickling! So thank you for raising this and introducing me to this new concept for me. So, my understanding of the vulnerability here is that the pickled byte-stream can be manipulated before unpickling the content which may include malicious executable code?

See https://docs.python.org/3/library/pickle.html and big red warning there. Pickle can contain ANY Python data including but not limited to executable functions. So, pickle should NEVER be used for data which are outside of your control. Especially, downloading pickles from Internet carries with itself serious risk of downloading and executing malicious code.

mcepl commented 3 months ago

I understand this is just an emergency maintenance brutal workaround, but would anything outside of the downloading stuff from Internet break with this patch?

---
 nltk/app/chartparser_app.py    |   13 +++++++++++++
 nltk/corpus/reader/util.py     |    2 ++
 nltk/data.py                   |    2 ++
 nltk/parse/transitionparser.py |    2 ++
 nltk/tbl/demo.py               |    4 +++-
 5 files changed, 22 insertions(+), 1 deletion(-)

--- a/nltk/app/chartparser_app.py
+++ b/nltk/app/chartparser_app.py
@@ -800,6 +800,10 @@ class ChartComparer:
             showerror("Error Saving Chart", f"Unable to open file: {filename!r}\n{e}")

     def load_chart_dialog(self, *args):
+        showerror("Security Error",
+                  "Due to gh#nltk/nltk#3266, deserializing from " +
+                  "a pickle is forbidden.")
+        return
         filename = askopenfilename(
             filetypes=self.CHART_FILE_TYPES, defaultextension=".pickle"
         )
@@ -811,6 +815,8 @@ class ChartComparer:
             showerror("Error Loading Chart", f"Unable to open file: {filename!r}\n{e}")

     def load_chart(self, filename):
+        raise RuntimeError("Due to gh#nltk/nltk#3266, deserializing from " +
+                           "a pickle is forbidden.")
         with open(filename, "rb") as infile:
             chart = pickle.load(infile)
         name = os.path.basename(filename)
@@ -2268,6 +2274,10 @@ class ChartParserApp:
         if not filename:
             return
         try:
+            showerror("Security Error",
+                      "Due to gh#nltk/nltk#3266, deserializing from " +
+                      "a pickle is forbidden.")
+            return
             with open(filename, "rb") as infile:
                 chart = pickle.load(infile)
             self._chart = chart
@@ -2306,6 +2316,9 @@ class ChartParserApp:
             return
         try:
             if filename.endswith(".pickle"):
+                showerror("Due to gh#nltk/nltk#3266, deserializing from " +
+                          "a pickle is forbidden.")
+                return
                 with open(filename, "rb") as infile:
                     grammar = pickle.load(infile)
             else:
--- a/nltk/corpus/reader/util.py
+++ b/nltk/corpus/reader/util.py
@@ -521,6 +521,8 @@ class PickleCorpusView(StreamBackedCorpu

     def read_block(self, stream):
         result = []
+        raise RuntimeError("Due to gh#nltk/nltk#3266, deserializing from " +
+                           "a pickle is forbidden.")
         for i in range(self.BLOCK_SIZE):
             try:
                 result.append(pickle.load(stream))
--- a/nltk/data.py
+++ b/nltk/data.py
@@ -752,6 +752,8 @@ def load(
     if format == "raw":
         resource_val = opened_resource.read()
     elif format == "pickle":
+        raise RuntimeError("Due to gh#nltk/nltk#3266, deserializing from " +
+                           "a pickle is forbidden.")
         resource_val = pickle.load(opened_resource)
     elif format == "json":
         import json
--- a/nltk/parse/transitionparser.py
+++ b/nltk/parse/transitionparser.py
@@ -553,6 +553,8 @@ class TransitionParser(ParserI):
         """
         result = []
         # First load the model
+        raise RuntimeError("Due to gh#nltk/nltk#3266, deserializing from " +
+                           "a pickle is forbidden.")
         model = pickle.load(open(modelFile, "rb"))
         operation = Transition(self._algorithm)

--- a/nltk/tbl/demo.py
+++ b/nltk/tbl/demo.py
@@ -253,6 +253,8 @@ def postag(
                 )
             )
         with open(cache_baseline_tagger) as print_rules:
+            raise RuntimeError("Due to gh#nltk/nltk#3266, deserializing from " +
+                               "a pickle is forbidden.")
             baseline_tagger = pickle.load(print_rules)
             print(f"Reloaded pickled tagger from {cache_baseline_tagger}")
     else:
@@ -327,7 +329,7 @@ def postag(
         with open(serialize_output) as print_rules:
             brill_tagger_reloaded = pickle.load(print_rules)
         print(f"Reloaded pickled tagger from {serialize_output}")
-        taggedtest_reloaded = brill_tagger.tag_sents(testing_data)
+        taggedtest_reloaded = brill_tagger_reloaded.tag_sents(testing_data)
         if taggedtest == taggedtest_reloaded:
             print("Reloaded tagger tried on test set, results identical")
         else:

cassneal commented 3 months ago

Hello, is there an ETA for a fix?

ekaf commented 3 months ago

Before getting too alarmed, we may want to wait for a sober analysis of this vulnerability, bearing in mind that it has been known for several years, without any known exploitation yet.

Still, pickles are not open source, which is in itself a good reason to avoid them. In addition to repackaging the data, a complete fix would require deep modifications in several Python files:


#!/bin/sh

tree=nltk
q=pickl
of="$tree"_$q.txt
echo -n > $of 
mxd=`ls -R $tree|grep /|awk -F "/" '{print NF}'|sort -nr|head -1`

for f in `find -L $tree -maxdepth $mxd -name "*py"`
do
if [ -n "`grep pickle $f`" ]
then
echo $f >> $of 
echo -------------------------------- >> $of
grep -ni $q $f >> $of
echo \\n--------------------------------------------------------------------- >> $of
fi
done

nltk_pickl.txt

nicolaschaillan commented 3 months ago

We really need a fix on this. Lots of customers are fricking out when they see a High CVE finding... Any ETA?

nicolaschaillan commented 3 months ago

Before getting too alarmed, we may want to wait for a sober analysis of this vulnerability, bearing in mind that it has been known for several years, without any known exploitation yet.

Still, pickles are not open source, which is in itself a good reason to avoid them. In addition to repackaging the data, a complete fix would require deep modifications in several Python files:
#!/bin/sh

tree=nltk
q=pickl
of="$tree"_$q.txt
echo -n > $of 
mxd=`ls -R $tree|grep /|awk -F "/" '{print NF}'|sort -nr|head -1`

for f in `find -L $tree -maxdepth $mxd -name "*py"`
do
if [ -n "`grep pickle $f`" ]
then
echo $f >> $of 
echo -------------------------------- >> $of
grep -ni $q $f >> $of
echo \\n--------------------------------------------------------------------- >> $of
fi
done
nltk_pickl.txt

Now that it has a public NIST CVE, people will exploit it. This isn't something you "wait" on. This is something you address right away...

nicolaschaillan commented 3 months ago

```diff
(testing_data)
+        taggedtest_reloaded = brill_tagger_reloaded.tag_sents(testing_data)
         if taggedtest == taggedtest_reloaded:
             print("Reloaded tagger tried on test set, results identical")
         else:

Could you share the .patch file by any chance? getting character issues when trying to apply..

adivekar-utexas commented 3 months ago

+1 we need a fix on this...many repos on https://github.com/amazon-science/ use nltk and are affected.

Dunedan commented 3 months ago

A possible mitigation is to not use NLTK's downloader functionality to download vulnerable data packages until this issue is fixed. Instead you could download the data packages your application needs manually once, verify they are legit and ship them bundled with your application.

wallies commented 3 months ago

As @Dunedan mentioned above. A workaround for now is to consider making your own package index. Considering the package index that is used is https://github.com/nltk/nltk_data. You can create your own fork and validate the packages you need and customise the mirror used by using --url

ekaf commented 3 months ago

Now that it has a public NIST CVE, people will exploit it. This isn't something you "wait" on. This is something you address right away...

@nicolaschaillan, before even starting NLTK, you load Python, which already exposes the user to the risk of being tricked into downloading malicious pickles anyway. In that context, wouldn't a fix in NLTK only marginally improve the user's security?

nicolaschaillan commented 3 months ago

Now that it has a public NIST CVE, people will exploit it. This isn't something you "wait" on. This is something you address right away...

@nicolaschaillan, before even starting NLTK, you load Python, which already exposes the user to the risk of being tricked into downloading malicious pickles anyway. In that context, wouldn't a fix in NLTK only marginally improve the user's security?

this isn't a discussion to see if there are other ways to get into something. Python has no such finding open at this time. This is a discussion to fix this CVE. This feature should be DISABLED unless the administrator explicitly enables it, with a narrow whitelist of resources.

At the end of the day, while I understand this could seem like an "expected behavior" and you can go fight that fight with NVD, you should have an option to enable this with basic ENV variable (disabled by default) or something like this.

Not sure why your contributors are making all this more difficult than it got to be. a CVE is a big deal for enterprises. You AI people seem not to understand cyber 101.

adivekar-utexas commented 3 months ago

@ekaf I agree with @nicolaschaillan and let me explain that I've gotten a high-severity ticket (both internally at Amazon and from GitHub's dependabot) because of this vulnerability.

Given the options of (a) not resolving this ticket and (b) removing all dependencies on nltk from my codebase, I am forced to choose (b), in order to comply with my organization's security requirements. It's either that or take down my repository.

ekaf commented 3 months ago

I inspected a typical pickle in each series, looking at their type() and eventual __dict__.

The tagsets and averaged_perceptron_tagger packages contain only simple data structures, and can easily be translated to a text-based data format, which solves the problem radically.

The rest of the pickles contain complex classes. A good translation may only be possible with help from the original package authors, and would require more time.

So at this moment, users who find themselves in an emergency, and need very strict settings, may need to consider some of the mitigations suggested above, like forbidding unpickling, or using their own secure fork.

nicolaschaillan commented 3 months ago

I inspected a typical pickle in each series, looking at their type() and eventual dict.

The tagsets and _averaged_perceptrontagger packages contain only simple data structures, and can easily be translated to a text-based data format, which solves the problem radically.

The rest of the pickles contain complex classes. A good translation may only be possible with help from the original package authors, and would require more time.

So at this moment, users who find themselves in an emergency, and need very strict settings, may need to consider some of the mitigations suggested above, like forbidding unpickling, or using their own secure fork.

The solution is simple. Whitelist which pickles can be loaded. Give control to the tenant to decide which ones are fine or not. Maybe by name or something. Right now, the fear is that third party can manage to load some that are malicious.

mcepl commented 3 months ago

The solution is simple. Whitelist which pickles can be loaded. Give control to the tenant to decide which ones are fine or not. Maybe by name or something. Right now, the fear is that third party can manage to load some that are malicious.

I don’t think, that’s good enough. As long as there is an opportunity how to make NLTK download pickle from the Internet (by using some internal undocumented functions, there are really no private and really inaccessible functions in Python), somebody can make a script to download malicious content from the Internet.

Pickle should really never be used on any content which is not 100% under the control of the admin of the machine. It really is meant to be used only for temporary caching, for serializing Python objects for multiprocessing, and similar very internal purposes, never for storage of the live content or even less for its exchange.

That’s the reason I suggested so harsh solution as above.

nicolaschaillan commented 2 months ago

The solution is simple. Whitelist which pickles can be loaded. Give control to the tenant to decide which ones are fine or not. Maybe by name or something. Right now, the fear is that third party can manage to load some that are malicious.

I don’t think, that’s good enough. As long as there is an opportunity how to make NLTK download pickle from the Internet (by using some internal undocumented functions, there are really no private and really inaccessible functions in Python), somebody can make a script to download malicious content from the Internet.

Pickle should really never be used on any content which is not 100% under the control of the admin of the machine. It really is meant to be used only for temporary caching, for serializing Python objects for multiprocessing, and similar very internal purposes, never for storage of the live content or even less for its exchange.

That’s the reason I suggested so harsh solution as above.

To be clear, I was proposing to have a whitelist feature to block the pickle loading function from loading any pickle that wouldn't be whitelisted, that would also solve the internet problem you're mentioning.

adivekar-utexas commented 2 months ago

I think whitelist/allowlist is not that useful, since it's a bit cumbersome to maintain. It becomes a file in the user's home directory, which hurts code portability (you need the file everywhere the code is run).

HuggingFace in 2022 faced a similar remote code execution vulnerability. Their solution was in two parts:

Add a flag trust_remote_code=True at the model/tokenizer level, e.g. AutoModelForCausalLM.from_pretrained('gpt2-xl', ..., trust_remote_code=True)
Add a TRUST_REMOTE_CODE environment variable (defaulting to False), which allowed remote code execution for all modules. Here is an example

I feel like (1) is a better long-term fix, but it seems like it could take a while to implement, given the number of modules.

@ekaf @stevenbird does (2) seem like a good idea and doable?

Concrete proposal: A single env variable TRUST_REMOTE_NLTK_CODE, which the user must manually enable in their code via os.environ["TRUST_REMOTE_NLTK_CODE"] = "True", prior to calling nltk modules.

If enabled, then module pickles can be downloaded as usual.
If not enabled, then remote code download/execution is blocked, using something like the patch suggested by @mcepl. Already-downloaded code modules should be executable (but maybe you want to introduce a STRICT_* flag to control this).

I feel like this is a good solution because it gives users control on when to execute remote code, without changing more than 1 line of code.

Experimentation code (in Jupyter notebooks) will probably be okay to set TRUST_REMOTE_NLTK_CODE=True.
Production code (running in Docker/etc) will probably pre-download the module pickles during the build phase, but run live using TRUST_REMOTE_NLTK_CODE=False.

Let me know what you think.

ekaf commented 2 months ago

@alvations, it is great to hear that you believe in the "deep clean" solution, where the pickles are completely removed forever.

There is also excellent news from gpt-4o, suggesting to serialize the more complex data types using Python's joblib module: chatgpt-4o-remedy.txt That is much easier than previously believed, and I have verified that the serialization works on the three complex pickle classes used with puntkt, maxent_ne_chunker, and maxent_treebank_pos_tagger. We need to verify that adjusted loaders can work as expected with the joblib dumps, and if everything goes well, it seems an ETA is within close reach now.

Dunedan commented 2 months ago

There is also excellent news from gpt-4o, suggesting to serialize the more complex data types using Python's joblib module

joblib uses pickle under the hood as well, so this wouldn't solve the issue.

ekaf commented 2 months ago

joblib uses pickle under the hood as well, so this wouldn't solve the issue.

Thanks @Dunedan! Instead, there seems to be a possibility to translate the pickles into Protobuf messages, but it may not be the most convenient, so I guess a combination of Json and .py code would be better.

nicolaschaillan commented 2 months ago

https://github.com/advisories/GHSA-cgvx-9447-vcch

CSi-CJ commented 2 months ago

https://nvd.nist.gov/vuln/detail/CVE-2024-39705. Hi, Is there an ETA? NVD doesn’t care how we use it, and doesn’t care how we avoid risks. They only care that there may be risks if we use it.

jdwhitaker commented 2 months ago

@Dunedan I see the CVSS base metrics says "Privileges required: None". I believe that's inaccurate and has elevated the CVSS score beyond what it should be.

You mentioned 2 ways this could be exploited, and they both seem to require privileges.

Controlling the NLTK data index presumably requires privileges.
Editing pickles in transit as a MITM presumably requires breaking TLS1.3, which requires knowledge of a privileged cryptographic key. If your POC was able to successfully edit pickles in transit, then that's a vulnerability that extends beyond just using pickles to insufficient encryption of data in transit.

sandimciin commented 2 months ago

Having a built-in downloader makes it harder than IMHO necessary to depend on nltk. I understand that some dependencies are several gigabytes in size, but even that should be manageable with a package manager like poetry/nix reusing a local download.

My vote would be to optionally allow depending on individual nltk data as packages, so that I can e.g. depend on nltk-punkt-english and know that a certain version of this data has been "locked" in my poetry lockfile.

https://github.com/nltk/nltk/issues/2228#issuecomment-1648551210

Dunedan commented 2 months ago

@Dunedan I see the CVSS base metrics says "Privileges required: None". I believe that's inaccurate and has elevated the CVSS score beyond what it should be.

I wasn't involved in calculating the severity. I believe that was done by Github's Security Curators team. If you believe that metric isn't accurate, there is a link on the page to suggest improvements for the vulnerability details.

That said, from my perspective "Privileges required: None" seems to be correct, as it refers to privileges on the targeted system prior to the attack. See https://www.first.org/cvss/v4.0/specification-document#Privileges-Required-PR for details.

ekaf commented 2 months ago

@Dunedan, to tamper with a user's files, the attacker needs at least the same privileges as that user. Then the definition of "Privileges Required: Low" seems to apply:

The attacker requires privileges that provide basic capabilities that are typically limited to settings and resources owned by a single low-privileged user.

According to that definition, @jdwhitaker's interpretation seems correct.

digi604 commented 2 months ago

According to our SLA, we would like to have a fix for this in the next 7 days. Otherwise, we are forced to rip out NLTK

Dunedan commented 2 months ago

@Dunedan, to tamper with a user's files, the attacker needs at least the same privileges as that user.

The attacker gains these permissions as part of the attack. What the "Privileges required" metric is about are privileges the attacker needs prior to the attack. For this vulnerability an attacker doesn't need any privileges on a targeted system prior to an attack.

cbornet commented 2 months ago

According to our SLA, we would like to have a fix for this in the next 7 days. Otherwise, we are forced to rip out NLTK

How much are you willing to pay to have a fix in less than 7 days ?

lucasgadams commented 2 months ago

Am I right in my understanding that if you are not explicitly running nltk.download then you are not at risk of this vulnerability? When we included nltk in our codebase a year ago, we did a one time download of the tokenizers we needed and then stored that in a location we control. We point NLTK to the data directory using the NLTK_DATA environment variable. If we assume that there has not been any exploitation of this vulnerability yet and the pickles from a year ago are safe, then I should be able to accept this risk? Is there a way to validate whether or not the existing pickles are safe and do not contain malicious code? And then checksum against known secure versions? Not talking about an actual fix just wondering whether I can suppress my vulnerability alert while we wait for a better fix.

ayushxx7 commented 2 months ago

following thread...

ekaf commented 2 months ago

There are now PRs available, with fixed data packages and nltk handlers for all the suspect packages in _nltkdata. So an ETA for the fix should be near. Meanwhile, it could be useful if more users would review the PRs, and eventually discuss alternatives. In particular, it would be helpful to confirm that nltk.data loads Json and Yaml data in a safe way.

noren95 commented 2 months ago

@ekaf Thank you for the update! Is this the fixing PR? https://github.com/nltk/nltk/pull/3286/ and is there an expected 3.8.2 release of nltk?

ekaf commented 2 months ago

Out of 6 data packages containing pickles, 3 have been replaced by Json and their new handlers were already merged into the develop branch last week:

help/tagsets
taggers/averaged_perceptron_tagger
taggers/averaged_perceptron_tagger_ru

The following have been replaced by Tab-files and are awaiting review:

tokenizers/punkt (#3283)
chunkers/maxent_ne_chunker (#3286)
taggers/maxent_treebank_pos_tagger

The last package is fixed, and #3286 now also includes a handler for it, although that package was not used from nltk before.

A release is planned, though I don't know about its version number.

chetanpadhye commented 2 months ago

As per Aqua scan this is CVSS Score NVD CVSSv3 9.8 NLTK through 3.8.1 allows remote code execution if untrusted packages have pickled Python code, and the integrated data package download functionality is used. This affects, for example, averaged_perceptron_tagger and punkt.

There is No Running Workloads, When this fix is expected ?

chetanpadhye commented 2 months ago

According to our SLA, we would like to have a fix for this in the next 7 days. Otherwise, we are forced to rip out NLTK

What is alternate you have ?

lpi-tn commented 2 months ago

Hello ! I just saw this CVE pop in my alerts, I download punkt and bcp47, if I stop downloading them from nltk_data but use my local file am I avoiding the problem ?

mcepl commented 2 months ago

Out of 6 data packages containing pickles, 3 have been replaced by Json and their new handlers were already merged into the develop branch last week:

Does it mean that the whole pickle opening will be removed from NLTK?

cherylking commented 2 months ago

A release is planned, though I don't know about its version number.

@ekaf Any updates on when we can expect a release that resolves this CVE?

cherylking commented 2 months ago

@ekaf An entire week of silence on this issue leaves everyone affected by this CVE scrambling...not knowing whether they have to eliminate the use of this library in their code by the end of July or not. Please respond with an outlook for the release that resolves this CVE.

cbornet commented 2 months ago

@ekaf An entire week of silence on this issue leaves everyone affected by this CVE scrambling...not knowing whether they have to eliminate the use of this library in their code by the end of July or not. Please respond with an outlook for the release that resolves this CVE.

This is OSS so the maintainers don't have to give you an ETA. I don't think they gain a dollar for the great work they do (do you want to be a sponsor ?) FWIW they could be in holidays for a month or this bug could be never fixed. If it's not acceptable for you, you can:

remove the dependency
live with the CVE and accept the risk (personally I would not download packages at runtime)
fork and fix it yourself

I'm sure the maintainers do what they can. They know very well that their project reputation is at stake.

ekaf commented 2 months ago

Thanks @cbornet, I have also wondered about how people could talk about an SLA when this is free software. However, I am sorry for not responding sooner to @cherylking. The reason is that there is no update for the moment, and your guess is as good as mine concerning the release date. But the outlook seems reasonable: if we can assume that Json is safe, it should be possible to close this issue before the end of July (without any kind of warranty, though).

nltk / nltk