secureIT-project / CVEfixes

CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software
Other
204 stars 52 forks source link

Assertion error when attempting to re-collect data #11

Open TomBolton opened 1 year ago

TomBolton commented 1 year ago

I was trying to re-collect the CVE data locally with a sample limit of zero, such that 2022 and 2023 records were included in the resulting database.

However, I get the following assertion error:

03/21/2023 14:19:30 git.cmd DEBUG Popen(['git', 'version'], cwd=/Users/tomb/CVEfixes, universal_newlines=False, shell=None, istream=None)
03/21/2023 14:19:30 git.cmd DEBUG Popen(['git', 'version'], cwd=/Users/tomb/CVEfixes, universal_newlines=False, shell=None, istream=None)
[]
03/21/2023 14:19:30 CVEfixes INFO ----------------------------------------------------------------------
03/21/2023 14:19:31 CVEfixes INFO The CVE json for 2002 has been merged
03/21/2023 14:19:32 CVEfixes INFO The CVE json for 2003 has been merged
03/21/2023 14:19:34 CVEfixes INFO The CVE json for 2004 has been merged
03/21/2023 14:19:35 CVEfixes INFO The CVE json for 2005 has been merged
03/21/2023 14:19:37 CVEfixes INFO The CVE json for 2006 has been merged
03/21/2023 14:19:39 CVEfixes INFO The CVE json for 2007 has been merged
03/21/2023 14:19:41 CVEfixes INFO The CVE json for 2008 has been merged
03/21/2023 14:19:43 CVEfixes INFO The CVE json for 2009 has been merged
03/21/2023 14:19:45 CVEfixes INFO The CVE json for 2010 has been merged
03/21/2023 14:19:53 CVEfixes INFO The CVE json for 2011 has been merged
03/21/2023 14:19:54 CVEfixes INFO The CVE json for 2012 has been merged
03/21/2023 14:19:57 CVEfixes INFO The CVE json for 2013 has been merged
03/21/2023 14:20:00 CVEfixes INFO The CVE json for 2014 has been merged
03/21/2023 14:20:02 CVEfixes INFO The CVE json for 2015 has been merged
03/21/2023 14:20:04 CVEfixes INFO The CVE json for 2016 has been merged
03/21/2023 14:20:08 CVEfixes INFO The CVE json for 2017 has been merged
03/21/2023 14:20:13 CVEfixes INFO The CVE json for 2018 has been merged
03/21/2023 14:20:16 CVEfixes INFO The CVE json for 2019 has been merged
03/21/2023 14:20:23 CVEfixes INFO The CVE json for 2020 has been merged
03/21/2023 14:20:30 CVEfixes INFO The CVE json for 2021 has been merged
03/21/2023 14:20:34 CVEfixes INFO The CVE json for 2022 has been merged
03/21/2023 14:20:35 CVEfixes INFO The CVE json for 2023 has been merged
03/21/2023 14:20:35 CVEfixes INFO Flattening CVE items and removing the duplicates...
03/21/2023 14:22:35 CVEfixes INFO All CVEs have been merged into the cve table
03/21/2023 14:22:35 CVEfixes INFO ----------------------------------------------------------------------
03/21/2023 14:22:37 CVEfixes INFO Extracting CWE data from cwec_v4.10.xml
03/21/2023 14:22:39 CVEfixes INFO Adding CWE category to CVE records...
03/21/2023 14:25:12 CVEfixes DEBUG List of CWEs from CVEs that are not associated to cwe table are as follows:
03/21/2023 14:25:12 CVEfixes DEBUG {'CWE-1026'}
Traceback (most recent call last):
  File "Code/collect_projects.py", line 245, in <module>
    cve_importer.import_cves()
  File "/Users/tomb/CVEfixes/Code/cve_importer.py", line 175, in import_cves
    assign_cwes_to_cves(df_cve=df_cve)
  File "/Users/tomb/CVEfixes/Code/cve_importer.py", line 128, in assign_cwes_to_cves
    assert set(list(df_cwes_class.cwe_id)).issubset(set(list(df_cwes.cwe_id))), \
AssertionError: Not all foreign keys for the cwe_classification records are present in the cwe table!

Note that if I re-run the code with a non-zero sample limit of 100, it finishes fine. Any ideas what might be causing this? Is it an external issue related to the data of CWE-1026?

yper3 commented 1 year ago

I have the same problem!!!

leonmoonen commented 1 year ago

Hi, Thanks for your interest in CVEfixes 🙏

I had a quick look, and this assertion error is caused by the fact that the latest CWE XML file distributed by Mitre (https://cwe.mitre.org/data/xml/cwec_latest.xml.zip, which is v4.10) does not contain CWE-1026 (search for CWE_ID="1026") but it is now defined as a view that groups several CWE-types. On the other hand, in the NVD distributed by Nist, CVE-2022-4147 is classified as CWE-1026 and not as one of its "subtypes"...

One "hack" to get things running is to manually update your CVE-2022-4147 to use CWE-1027, the first subtype in the view, or to add 1026 as a member of itself in cwec_v4.10.xml under View ID="1026". The collection scripts will not overwrite files that are already there, so this allows you to continue/restart as if you had the correct info from NVD/CWE.

We may also be able to code around this in the collection but I'm a bit reluctant to do that now, as a lot of things are in flux around the NVD data distribution while they are moving from distributing JSON feeds to only providing API access (and this is after dropping XML for JSON two-three years ago). At some point, we will need to do a rather deep rewrite of CVEfixes to use this new API, but we, unfortunately, do not have the resources available for doing this in a short term.

cheers, Leon

TomBolton commented 1 year ago

Hi Leon, thanks for the fast response!

One "hack" to get things running is to manually update your https://github.com/advisories/GHSA-9895-g6x5-xwcp to use CWE-1027, the first subtype in the view, or to add 1026 as a member of itself in cwec_v4.10.xml under View ID="1026". The collection scripts will not overwrite files that are already there, so this allows you to continue/restart as if you had the correct info from NVD/CWE.

I will try this out now 🙏

TomBolton commented 1 year ago

So far so good 👍

TomBolton commented 1 year ago

The data collection took 2 days and did get to the end thankfully.

However, right at the end, I get an error stating the cwe_classification table doesn't exist in the database...

Screenshot 2023-03-24 at 08 59 57

...which is true:

Screenshot 2023-03-24 at 09 54 28

The CVEfixes.db is ~27GB, so it does seem like it's just the cwe_classification data missing. Any ideas on what to do here?

leonmoonen commented 1 year ago

My apologies, this one slipped through the cracks. The only guess I have is that the tables were not created as a result of the initial error, and after "hacking" the CWE data, the code path was never exercised again because there was already a database. I'll need to dive into the code to make this more robust, but this may take some time due to other obligations. For now, my only suggestion would be to do a fresh run (i.e. remove the Data/CVEfixes.db) while you have the "hacked" CWE data files already in place. It should also be possible to just add the CWE table to the existing database, but that would need some dedicated code (will save on the overall collection time, though).

Note that there is also a chance that the initial hack is no longer needed and the CWE and CVE records have been synced in the NVD...

sarthak247 commented 12 months ago

Hey. I was also working with CVEFixes and encountered the same error. Any updates on this?

leonmoonen commented 12 months ago

We have not been able to find the source of this error as it does not always happen (which makes me think it may be caused by a race condition). We are working on a new version of CVEfixes that uses NIST's APIs instead of the JSON files (which will be phased out). This update is turning into such a substantial overhaul of the project that I expect that race conditions in the old code will not survive (famous last words ;-)).

leonmoonen commented 3 months ago

I've been out of the running due to a traffic accident, but I wanted to share that the new version of CVEfixes is planned to be finished around August/September 2024. A newly collected dataset will be released on Zenodo in the next few days.