Download the snapshots backup

responsible-ai-collaborative / aiid

The AI Incident Database seeks to identify, define, and catalog artificial intelligence incidents.

https://incidentdatabase.ai

Other

167 stars 35 forks source link

Download the snapshots backup #2954

Closed rcao1997 closed 1 month ago

rcao1997 commented 1 month ago

Hello,

I would like to first give my appreciation on this project, thank you for establishing such a wonderful database.

I am reading through the README file and trying to download the latest database backup from https://incidentdatabase.ai/research/snapshots. I tried to do so by clicking on the .tar.bz2 links, but none of the links work. They all direct me to 'Your connection is not private' page (I tried Chrome and Safari). Is this the correct way to download the snapshots? And how should I solve this issue? Thank you so much in advance.

smcgregor commented 1 month ago

Hi @rcao1997, great to hear you are finding the database useful. It sounds like you may have a VPN or firewall that is blocking your ability to download the data. The address for the backup data is ugly looking (see below). Maybe you have some security software that doesn't want you to download a file from such an address? Please let us know!

https://pub-a5fe3e44369c4cabb576fa0d2c09fdf6.r2.dev/backup-20240708100539.tar.bz2

rcao1997 commented 1 month ago

Hi @smcgregor, thank you for the reply! It turns out that the Wi-Fi that I am using right now has an additional layer of firewall. I successfully got the data using another Internet. Appreciate that!

And I have two other questions related to the data. According to the README file, I have to use MongoDB account to access the snapshoot files. But I notice there is an incident.csv file in the data that I downloaded, and seems like I can directly use the csv file. Is there any difference between using MongoDB and directly use the csv file as it is?

And also, I notice in the website that, each incident has a news article in describing it. I randomly opened a few csv files, it seems like the 'text' column is the news article. But many of the csv files do not have that column. Is there a way that I can have access to the news article text information from the dataset?

Really appreciate your help.

smcgregor commented 1 month ago

Hi @rcao1997 it is good to hear you got access to the download.

The MongoDB snapshot is the most definitive as it includes almost everything in the database. The CSV files have much of the information, but I highly recommend interfacing with the data in MongoDB since the dataset is getting to be quite large and the CSVs will never have full coverage. Have you tried this before? Please let us know if that proves too difficult.

@kepae do you have another suggestion here?

kepae commented 1 month ago

Hi @rcao1997, I second @smcgregor's suggestion. Though specifically I would restore the database snapshot into a local running version of MongoDB on your machine, and then query the data using something like MongoDB Compass. This is how the dev team often works with the data! (That is -- there is no need to create a MongoDB Atlas (cloud-hosted) account and work with the data there. Rather, I suggest staying local.)

You can see instructions for installing MongoDB on your platform here: https://www.mongodb.com/docs/manual/installation/

Here are some suggestions on how to restore the snapshot into your DB instances: https://stackoverflow.com/questions/18931668/how-to-restore-the-dump-into-your-running-mongodb

And indeed, the .csv files are not always the most comprehensive, as they have to flatten a lot of nested data and make compromises.

Finally, an alternative way to start exploring the data is by using our unsupported GraphQL API. This API is exposed for the website to query on live requests and less for user-friendly analysis of the data, but it often helps to start here. An example query: https://cloud.hasura.io/public/graphiql?endpoint=https%3A%2F%2Fincidentdatabase.ai%2Fapi%2Fgraphql&query=query+MyQuery+%7B%0A++incidents%28query%3A+%7Bdate_gte%3A+%222024-01-01%22%7D%29+%7B%0A++++date%0A++++description%0A++++title%0A++%7D%0A%7D%0A

(We are working on developing a fully supported API. :-) )

rcao1997 commented 1 month ago

Hi @smcgregor and @kepae, thank you so much for your suggestion. I really appreciate it. I have followed the steps and I am able to restore the snapshot to my local MongoDB environment. However, because I need to use Python to integrate the data, so I decide to directly use Python to import the bson file. I have the following questions, it would be great if you could take your time to solve these for me:

My first question is whether my usage of Python to open the bson file is valid? I plan to use the incidents.bson file from each snapshot, integrate them to one Pandas dataframe and check duplicates. Should I also consider any other files if I want to integrate the whole database?

Also, I found the data is not complete compared to the data that I can access from your website. The data is incomplete for both MongoDB portal and Python reading. For example, for this case: https://incidentdatabase.ai/cite/1. 1) All taxonomy classification is not shown in the dataset. 2) The 'Full Description' under CSETv0 Taxonomy Classifications is not in the data. Instead, the local data shows different contents for description. 3) Report timeline is not in the data. My guess is maybe this incident is updated with all these information later in the database? If not, how should I do to generate all these information?

Finally, is it possible for me to get the news article for each incident?

Again, I really appreciate your time and your support. I am very excited about this database and hope I am able to use it very soon.

smcgregor commented 1 month ago

My first question is whether my usage of Python to open the bson file is valid? I plan to use the incidents.bson file from each snapshot, integrate them to one Pandas dataframe and check duplicates. Should I also consider any other files if I want to integrate the whole database?

You can find schema files detailing what is in each collection here. You will likely need to join incidents.bson with another collection to get everything you need, including the report contents you are asking for in the next question.

On the topic of the CSETv0 taxonomy, I think there may be a difference between the name of the field as displayed in the user interface and the name you will find in the archive. Can you check on this and if you don't find it we will take a deeper look into the archive?

rcao1997 commented 1 month ago

My first question is whether my usage of Python to open the bson file is valid? I plan to use the incidents.bson file from each snapshot, integrate them to one Pandas dataframe and check duplicates. Should I also consider any other files if I want to integrate the whole database?

You can find schema files detailing what is in each collection here. You will likely need to join incidents.bson with another collection to get everything you need, including the report contents you are asking for in the next question.

On the topic of the CSETv0 taxonomy, I think there may be a difference between the name of the field as displayed in the user interface and the name you will find in the archive. Can you check on this and if you don't find it we will take a deeper look into the archive?

Thanks for your suggestion. I looked over the collections and it seems like everything I need is stored in classifications.bson column attributes. The number of rows in this file is less than the rows in the incidents.bson file, my understanding is that the information for some cases are not detailed enough to put in the classification file.

I will close this issue for now, and thank you again for solving all my questions. Please let me know if there is any additional information that I should be aware of.

kepae commented 1 month ago

Glad to hear you can find the data you need @rcao1997 !

If you are interested, please feel free to join the AIID research and development community on Slack! We can answer more questions there, too.

In the future we plan to offer a more convenient, online API for accessing incident data and classifications without requiring you to manually join the data. cc @cesarvarela