Getting error while trying to create dataset from scratch

aditya-deshpadne commented 2 years ago

Hello, I was trying this tool to create the dataset from scratch. Followed the steps mentioned on Install.md. also all dependencies are satisfied. However, when i try to run the script create_CVEfixes_from_scratch.sh I get below error pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT * FROM cwe_classification': no such table: cwe_classification Could you please help with pointers to resolve the issue.

Thanks in advance.

leonmoonen commented 2 years ago

Hi Aditya,

I've done some initial replication experiments and it seems I cannot reproduce this issue when using python3.8 and the dependencies from requirements.frozen.txt (either on an Intel Mac or on Ubuntu server 20.04LTS). Can you add some information about your runtime environment?

I can confirm that python3.10 gives an issue in successfully collecting the pip requirements, both on an Intel Mac and on Ubuntu server 22.04LTS (where 3.10 is the default). Moreover, I haven't been able to make this work on ARM CPUs (due to tensorflow 2.5.0 not being available there, which is required by guesslang).

Finally, I found an issue with MITRE having changed the (version number of the) CWE filename that we explicitly expected. Later tonight, I will release v1.0.3 which has code to address this issue, as well as better logging which may help in the debugging.

cheers, Leon

aditya-deshpadne commented 2 years ago

Hi Leon,

Thank you very much for the new version. I pulled the latest code and I was able to execute the tool successfully with sample_limit = 500. I will continue to generate the results.

Thank you, Aditya

aditya-deshpadne commented 2 years ago

Hi Aditya,

I've done some initial replication experiments and it seems I cannot reproduce this issue when using python3.8 and the dependencies from requirements.frozen.txt (either on an Intel Mac or on Ubuntu server 20.04LTS). Can you add some information about your runtime environment?

I can confirm that python3.10 gives an issue in successfully collecting the pip requirements, both on an Intel Mac and on Ubuntu server 22.04LTS (where 3.10 is the default). Moreover, I haven't been able to make this work on ARM CPUs (due to tensorflow 2.5.0 not being available there, which is required by guesslang).

Finally, I found an issue with MITRE having changed the (version number of the) CWE filename that we explicitly expected. Later tonight, I will release v1.0.3 which has code to address this issue, as well as better logging which may help in the debugging.

cheers, Leon

Thank you ! It worked for me now.

leonmoonen commented 2 years ago

Just a heads up, there still is an (unrelated) bug when URLs have become unavailable after they have been included in CVEs (e.g., because they have become private). There are 51 of those in total CVE data up to Aug 2022. We have code to remove those URLs but that seems to have stopped working correctly. As a result, the git clone during the collection of an unavailable repository will ask to login, effectively stopping progress... I'll try to push a fix for this later today.

aditya-deshpadne commented 2 years ago

Ok,Thank you!

aditya-deshpadne commented 2 years ago

Hello Leon, Just one more query regarding collecting from scratch, If i try to re-run the create_CVEfixes_from_scratch.shscript, does it crawl incrementally or does it starts all again from the scratch.

leonmoonen commented 2 years ago

Just one more query regarding collecting from scratch, If i try to re-run the create_CVEfixes_from_scratch.shscript, does it crawl incrementally or does it starts all again from the scratch.

The complete answer is that there are two levels of incrementality to consider here:

What I think of as "ultimate incrementality": if you start with a database collected in June 2021, can you run the collection in Aug 2022 to update the complete database to all CVEs published in Aug 2022, but only downloading new or changed records?
- Although I recognize this would be great to have, this level of incrementality is unfortunately not supported at the moment. We've made this choice since "old" CVE records can be updated later on (e.g., adding additional systems or linking to new patches). We felt that the logic to check, update and interlink existing collected records for such changes was going to be too failure-prone to warrant the potential improvements in collection time.
What I'll call "recovery incrementality": if you start a collection process today and it breaks for some reason, can you restart it, and will the collection continue from where it left off?
- We do support this level of incrementality and the process will continue to collect the same set of CVEs as you were aiming to collect in the initial run.

aditya-deshpadne commented 2 years ago

Just one more query regarding collecting from scratch, If i try to re-run the create_CVEfixes_from_scratch.shscript, does it crawl incrementally or does it starts all again from the scratch.

The complete answer is that there are two levels of incrementality to consider here:

What I think of as "ultimate incrementality": if you start with a database collected in June 2021, can you run the collection in Aug 2022 to update the complete database to all CVEs published in Aug 2022, but only downloading new or changed records?

Although I recognize this would be great to have, this level of incrementality is unfortunately not supported at the moment. We've made this choice since "old" CVE records can be updated later on (e.g., adding additional systems or linking to new patches). We felt that the logic to check, update and interlink existing collected records for such changes was going to be too failure-prone to warrant the potential improvements in collection time.

What I'll call "recovery incrementality": if you start a collection process today and it breaks for some reason, can you restart it, and will the collection continue from where it left off?

We do support this level of incrementality and the process will continue to collect the same set of CVEs as you were aiming to collect in the initial run.

Thank you ! I was looking for 'ultimate incrementality' , however "recovery incrementality" also helps.

awsgaucho commented 2 years ago

Hi Leon,

My understanding from the README is that a full collection from scratch can take multiple days (depending on connection speed). Good to know the process can be recovered and resumed if something fails along the way. Thanks for confirming that!

Regarding "ultimate incrementality", I see your point that it's hard to get that right if already-collected data may have changed. Something I'd like to point out is that this limits the usability of CVEfixes for staying up-to-date with recent CVEs (which a lot of folks will want to do).

Since collecting is a multi-day process and needs to be re-done from scratch each time, this limits the frequency with which one might be willing to fetch updates, and pretty much installs a multi-day delay.

Are there any plans to make staying up-to-date more feasible? Maybe an incremental mode with weaker guarantees, i.e., one that doesn't guarantee you're getting the latest version of the previously fetched CVEs?

For our use case, we'd be happy to incrementally fetch the latest news daily without such guarantees, and then do a full refetch-from-scratch weekly or bi-weekly to get any changes to the older history and fix any broken links.

Is this something you might envision adding? It would greatly help us adopt CVEfixes. Thanks in advance for considering it.

leonmoonen commented 2 years ago

I would definitely be open to having an incremental mode in the collection process, and a version with weaker guarantees certainly seems feasible. I don't know how much time I will have to work on this myself, but I would certainly welcome all contributions in this direction.

That said, we have been working on a major update that adds some extra fields that could also be of interest for you in CodeGuru context (such as the license under which the project was released). This update needs a bit more polishing before we can release it, but I'll prioritize that, to make it easier for people to contribute.

Given that the collection is largely an embarrassingly parallel problem, there are also a bunch of optimizations that could be achieved by just farming out the process over multiple workers. We already do a bit of this, but it could absolutely be done in more places. The only thing to look out for is hammering GitHub's servers, since they contain most of the code (we already need to control for this when checking if referenced repositories are still available).

I will add incremental mode as a new issue/feature request.

aditya-deshpadne commented 2 years ago

Hello Leon, I tried to execute create_CVEfixes_from_scratch.sh with latest version v1.0.5, script ran for couple of days, however getting below error.

CVEfixes WARNING Could not retrieve commit information from: https://github.com/bottlepy/bottle
Code/create_CVEfixes_from_scratch.sh: line 8: <pid> Killed   python3 Code/collect_projects.py

Could you please help here with pointers to resolve the issue.

Thank you, Aditya

leonmoonen commented 2 years ago

Hi Aditya,

It is a bit hard to see what went wrong just based on that message. FYI, (after a few restarts) I have recently successfully collected the dataset on a VM with 8 cores, 64Gb of memory and 75Gb of disk space. This took around a day (with num_workers = 8 in the .ini file). I noticed the resulting database is around 20Gb (much larger than the ~4Gb we had a year ago) and we will do some additional testing to check that there was no regression due to changed behavior in libraries we depend on. If all is good I'll share the data via Zenodo for reuse. I can already see that the bottlepy repository that was an issue in your case was collected correctly in my run.

Some pointers/questions to debug your issue:

did you set up a GitHub access token? I'm guessing you did if the collection ran for several days, but not having that could cause issues with accessing repositories after you've gone through the initial free amount of requests.
is there enough free disk space? Repositories are cloned in /tmp during extraction (and removed after), but if this folder is on a partition that runs out of disk space, the observed behavior could occur.
(related) is there enough free memory? My earlier collection attempts failed very similarly to yours due to only having allocated 24Gb of memory. Allocating more memory solved that for me.
note that in the last stage of the collection, we clean up the database, and roughly double the space is needed to (temporarily) hold both the old and new data.
have you tried restarting the collection? What happens the second time?

cheers, Leon

aditya-deshpadne commented 2 years ago

Hi Leon,

Thank you for your reply.

Yes, looks like i have sufficient memory, and the size of the DB is 19G. also i dont have credentialscreated for gitlab, so those repositories will be ignored. Do you think it is successful execution.

I tried re running the script but getting the same message.

Thank you, Aditya

leonmoonen commented 2 years ago

Hi,

I have never seen killed processes in a successful collection run, so I don't think it was a fully successful execution.

Could you give it one more try: please set the logging level to DEBUG in your .CVEfixes.ini, re-run the script while capturing the log[^1], and share the log here or by email? Hopefully, that gives us a bit more insight into what is going on...

[^1]: (this may be overkill, but) I do this in bash while also timing the execution using: { time sh Code/create_CVEfixes_from_scratch.sh ; } 2>&1 | tee cvefixes.log

thanks, Leon

aditya-deshpadne commented 2 years ago

Hi Leon,

Tried to rerun from scratch. Getting the same error with system 94% cpu I have 8 core machine. and i had set num_workers = 6 in ini file.

Thank you, Aditya

awsgaucho commented 2 years ago

Leon, thank you for your help debugging this issue.

Aditya, could you please do one more attempt and confirm that you have addressed all of Leon's recommendations? I.e.:

use the GH token
ensure 64 GB RAM, at least 8 vcores, and num_workers = 8 in your .ini file
ensure at least (2 * 20) GB free disk space in /tmp (ideally at least 75 GB in /tmp and 75 GB in whatever partition you are running the tool, to ensure we reproduce Leon's successful use context)
enable DEBUG mode

If you still get an abnormal termination or killed processes, please email Leon the full log plus the full .ini file.

Thank you both!

awsgaucho commented 2 years ago

Also, @leonmoonen, would you let us know when you share your most recent DB in Zenodo?

It is important for us to confirm that we can successfully run the tool before we adopt it. But while we get there, an up-to-date DB would help us run some test experiments in parallel on recent data.

Thanks again.

leonmoonen commented 2 years ago

Hi,

The most recent dataset (v1.0.7, collected Aug 27, 2022) is now available via https://zenodo.org/record/7029359.

@aditya-deshpadne, how are things going with re-collecting/capturing the debug log? We have now added some debugging options to just collect a specific set of CVEs, and we cannot reproduce the errors that you encountered with the bottlepy repositories, so I'm afraid that this was caused by a harder-to-debug environment problem (we have done many successful collects in various environments since the issue was opened). Having a log would help to dig into into your particular challenges. Alternatively, we could try to provide a Docker or Singularity environment, although that requires a bit of tweaking to the main logic to "externalize" the GitHub API key (and ideally also the database that is generated), so this may take a bit of time... let me know if this would be of interest...

cheers, Leon

aditya-deshpadne commented 2 years ago

Thank you @leonmoonen / @awsgaucho , I think the error was due to memory and cpu. I was able to run the script successfully with additional power.

08/27/2022 07:52:14 CVEfixes INFO Data pruning has been completed successfully
08/27/2022 07:52:14 CVEfixes INFO ----------------------------------------------------------------------
08/27/2022 07:52:15 CVEfixes INFO The database is up-to-date.
08/27/2022 07:52:15 CVEfixes INFO ----------------------------------------------------------------------
08/27/2022 07:52:15 CVEfixes INFO Time elapsed to pull the data 50:42:15 (hh:mm:ss).

Size of the DB is around 21G

secureIT-project / CVEfixes

Getting error while trying to create dataset from scratch #4