pochedls / xagg

Software to create xml links to underlying CMIP netCDF data
1 stars 1 forks source link

Flag / delete retracted data #30

Closed pochedls closed 4 years ago

pochedls commented 4 years ago

The xml files we use currently link to every possible dataset we can find, including retracted data. We should flag or delete these files in some way so that the user can avoid them.

A web-based api would be incredibly slow unless we can create some kind of bulk request for retracted datasets.

@painter1 / @durack1 - do you have ideas about how we can identify and flag / delete retracted data? Are there local databases that contain this information?

durack1 commented 4 years ago

@pochedls theoretically @painter1's sdt6.db is exactly what we'd want to query here. Not sure how up-to-date that would be with ESGF API calls.

painter1 commented 4 years ago

One of my cron jobs is a script which runs early each morning and queries the index node for newly retracted data. It marks the status of affected files and datasets in the Synda database. This is backed up to sdt6.db the following night.

There is a minor problem in that I try to get only the latest datasets in order to control the load on the index node. But the index node isn't consistent in the order of results. So sometimes things are missed. From time to time I manually force it to get everything, so that we won't miss anything. I haven't gotten around dealing with this, but I plan to: (1) discuss with Sasha whether he can change the ESGF software so that I can reliably get incremental results, (2) If that doesn't help, then automate the present procedure. That is, get a complete list once a week as well as the daily incremental list.

Because I have gotten complaints about putting a load on the index node (albeit not for this reason), I think it would be better to get retracted data lists from sdt6.db than to query the index node directly. It also would be easier to program.

durack1 commented 4 years ago

@painter1 thanks for the update, this is very useful information in which to plan our use of local info

pochedls commented 4 years ago

@painter1 - Thanks for your help with this. Would retracted be reflected in the dataset table under the status column? If you could point me to an example (or the table / column / value) that I can use that would be helpful.

I've outlined a script to:

painter1 commented 4 years ago

Yes. If a file or dataset’s status contains the string “retracted”, then it is retracted. Here is a list of all the file statuses at present. Most of these are nonstandard ones which I made up for local use; they are not recognized by Synda so files with these statuses are ignored, that’s what I want: synda queue status count size _Nor_misnamed 33523 6.71 TB done 3171489 930.33 TB done,retracted 637724 124.20 TB error 40146 25.39 TB error,retracted 10 45.30 MB error-badurl 8049 4.37 TB error-checksum 7425 924.51 GB error-path 334718 91.05 TB obsolete 29303 25.59 TB published 8258174 3.56 PB published,retracted 1123452 303.26 TB retracted 462069 100.89 TB running 69 388.04 GB waiting 493171 121.93 TB waiting,retracted 456888 67.67 TB

There are not nearly so many dataset statuses, but the story is the same: check whether status.find(‘retracted’)>=0 .

You don’t have to worry about corrupting sdt6.db. It is not my working database, or even my main backup. It's there for you. Every night a little past 11:00pm, my working database is copied over it. If you will want to reference it around that time, let me know and I will change to do a copy-delete-rename rather than a simple copy.

pochedls commented 4 years ago

@painter1 - I appreciate your fast response. So I didn't find any datasets with the status retracted - but there were many files with this status (I killed my query).

Our xml files are granular to the path level (e.g., we run cdscan -x xmlname.xml /path/to/data/*.nc); we don't care that much about individual files. How can I best figure out which xml files should be removed due to data retractions?

Would I run a query like this:

SELECT * FROM file WHERE status='retracted';

and then parse every file to subset the unique local_path values?

painter1 commented 4 years ago

SELECT FROM file WHERE status LIKE '%retracted'; SELECT FROM dataset WHERE status LIKE '%retracted';

In the working copy of the database at this moment, there are 2,685,804 files and 112,113 datasets matching the above criteria. Your sdt6.db dates from last night, so your numbers today would be a little less.

I'm not totally confident that the dataset and file lists match as well as they should. But I am totally confident that everything marked as 'retracted' really is retracted.

Both the file and dataset tables have a column 'local_path'. It will start with CMIP6/ and end with the filename for files, or the version for datasets. If you organize things by path, you might want to look at that.

pochedls commented 4 years ago

I created a retract script, which looks at @painter1's database and finds retracted datasets. It then ignores any matching datasets in the xagg database, gives them error=retracted, and deletes any xml files that were generated. It removed around 24k files.

Need to incorporate this into nightly job before resolving the issue.

durack1 commented 4 years ago

@pochedls a problem with this is that an analysis that has been published that used the now retracted data will not be reproducible. My preference would be to update the filename of the xml (using the 000000.xml identifiers) and keep the xml around, and only purge this if the data itself was purged.

pochedls commented 4 years ago

This is kind of complicated, rare situation, and I already purged the xmls that correspond to retracted data (as outlined above). In the future, I'll move the retracted xmls to a retract directory if people need to access them. If I find a file is retracted before a scan, my plan is to never scan it.

durack1 commented 4 years ago

When I eventually get back into my env, I'll not be surprised if some of the data that I generated for the AR6 SOD is no longer available, I suppose it's not published, but not the most ideal situation.

Absolutely no need for any action in response to my query, but it would be great to think a little about this and consequences before implementing future changes

pochedls commented 4 years ago

@durack1 - why do you want to use data that was explicitly retracted?

durack1 commented 4 years ago

Simple really, I want to be able to recreate a past analysis. In the case that there was a query, I would want to be able to understand what the implications were for (at the time) erroneous data being included that were later identified as problematic and unpublished. In the case that such data no longer exists, you can't do any detective work to ascertain how such errors impacted your analysis

pochedls commented 4 years ago

@durack1 - I will regenerate the xml files that correspond to retracted data over the weekend. The updated script will move these files to /p/user_pub/xclim/retracted/.

durack1 commented 4 years ago

Sorry we didn't chat about this before the 24k were killed/deleted, I hope this isn't much work, if it is then it's not a priority at all

pochedls commented 4 years ago

These files are now in the retracted directory. Ticket is still open because I need to setup a cron job to automate this process.

pochedls commented 4 years ago

This should resolve this ticket: https://github.com/pochedls/xagg/commit/52f488480718af1eecc258718b457cf2180b46b4. Will check tomorrow to make sure everything worked as expected.