hydroshare / hydroshare

HydroShare is a collaborative website for better access to data and models in the hydrologic sciences.
https://www.hydroshare.org
BSD 3-Clause "New" or "Revised" License
180 stars 38 forks source link

Files are not synchronized between Django and iRODS on beta and other dev environments #2031

Closed alvacouch closed 7 years ago

alvacouch commented 7 years ago

A quick check of the 1.10.0 beta.hydroshare.org shows that iRODS and Django disagree on the names of some files. This may be snapshot skew, but we don't have the tools to check that.

One temporary solution is to write a file validator that checks whether iRODS and Django are synchronized. This can be utilized to track down the causes of any mis-synchronization.

pkdash commented 7 years ago

@alvacouch What about www? Can you do the same check there?

alvacouch commented 7 years ago

WWW is much harder to test because it's prior to the ResourceFile normalization. Possible but requires me to run the migration (at least in spirit) in order to perform the test. Lots more work because the migration is run in a different syntax than production. I would need to port it.

Is this necessary?

pkdash commented 7 years ago

@alvacouch Don't worry about it then.

alvacouch commented 7 years ago

@pkdash Will look into checking www after I finalize this version. Would that be a branch off of master?

pkdash commented 7 years ago

@alvacouch If you want this validation check to run on www prior to 1.10 release then the branch needs to be off of 'master' so that @mjstealey can apply it as a hotfix to master. However, I think in this particular case of hotfix the master should not be merged back to 'develop' since the develop would have the real validator based on the new file api.

alvacouch commented 7 years ago

Here are the results of the first runs of the validation script:

OUTPUT-develop.txt (EDIT: updated for new script) OUTPUT-www.txt

These do not compensate for misplaced hydroshareUserProxy resources. EDIT: These are fixed in the new script for develop

dtarb commented 7 years ago

I checked several of the "problem" resources in OUTPUT-www.txt and found that quite a few are not a problem.

It would be good to avoid reporting these so we can see the ones that are problems more easily.

The concering errors I found (in spot checking) are ERROR: no valid file name defined for 6b165657dc1f437a873c58d07c3d97a9 (NetcdfResource) check_irods_files: django file None in folder None, resolved to None, does not exist in iRODS ERROR: no valid file name defined for 6b165657dc1f437a873c58d07c3d97a9 (NetcdfResource) check_irods_files: django file None in folder None, resolved to None, does not exist in iRODS check_irods_files: listing of iRODS directory 6b165657dc1f437a873c58d07c3d97a9/data/contents failed check_irods_files: affected resource 6b165657dc1f437a873c58d07c3d97a9 type is NetcdfResource, title is 'This resource can't be deleted'

This is a resource that @gantian127 owns and she has flagged as unable to delete. I suggest we just delete this as admin.

Some resources indicated to be federated that appear to have been created from iRODS because they are bigger than 1 GB. Some of these are iUtah resources, so we should work with @horsburgh and @AmberSJones to diagnose. WARNING: federated file name or path declared for unfederated resource 094da7d9400f493fb1e412df015e17a4 (GenericResource): data/contents/annual-reports-by-year.zip INFO: data/contents/ stripped from fed name or path: annual-reports-by-year.zip for 094da7d9400f493fb1e412df015e17a4 (GenericResource) check_irods_files: file 094da7d9400f493fb1e412df015e17a4/data/contents/2009-and-later-annual-reports-coded (3).csv in iRODs does not exist in Django check_irods_files: file 094da7d9400f493fb1e412df015e17a4/data/contents/2009-and-later-variable-codebook (3).csv in iRODs does not exist in Django

For the below the UI indicates 4 files suffixed 2013-2016. If the bag is downloaded it holds 4 files. So the file iUTAH_GAMUT_RB_TM_C_RawData_2017.csv does not appear to be in the system and the user does not see any error.
ERROR: existing path aff4e6dfc09a4070ac15a6ec0741fd02/data/contents/iUTAH_GAMUT_RB_TM_C_RawData_2017.csv is not conformant for aff4e6dfc09a4070ac15a6ec0741fd02 (GenericResource) ERROR: no valid file name defined for aff4e6dfc09a4070ac15a6ec0741fd02 (GenericResource) check_irods_files: django file aff4e6dfc09a4070ac15a6ec0741fd02/data/contents/iUTAH_GAMUT_RB_TM_C_RawData_2017.csv in folder None, resolved to None, does not exist in iRODS check_irods_files: affected resource aff4e6dfc09a4070ac15a6ec0741fd02 type is GenericResource, title is 'iUTAH GAMUT Network Raw Data at Todd's Meadow Climate Site (RB_TM_C)'

alvacouch commented 7 years ago

@pkdash @hyi @aphelionz Just a quick note. I remember now that in revising ResourceFile, I took some pains to make sure that ResourceFile.delete works properly now. It didn't work properly before. This could explain a lot of the garbage I am finding as unreferenced iRODS files, especially in the directories with a lot of churn. In other words, it might be a bug that has already been squashed. .

hyi commented 7 years ago

@alvacouch Great. Then I think those unreferenced iRODS files can be safely deleted to get us to a cleaner baseline, perhaps you can provide a switch in your management command to allow admin to delete these unreferenced irods files if they decided it is safe to do so.

alvacouch commented 7 years ago

@hyi @pkdash @dtarb @mjstealey A new output from revised check_irods_files has been uploaded to the google drive.

Highlights:

Remaining problems:

guid type title problem
094da7d9400f493fb1e412df015e17a4 GenericResource Utah Municipalities Stormwater Annual Reports files not in Django
aff4e6dfc09a4070ac15a6ec0741fd02 GenericResource iUTAH GAMUT Network Raw Data at Todds Meadow Climate Site (RB_TM_C) files not in iRODS
5e80dd7cbaf04a5e98d850609c7e534b GenericResource iUTAH GAMUT Network Raw Data at Knowlton Fork Climate Site (RB_KF_C) files not in iRODS
325b21d55b2c49658a91944fabd896cf GenericResource iUTAH GAMUT Network Raw Data at the Green Infrastructure Climate Site (RB_GIRF_C) files not in iRODs
9e5e99125d1646c69dde9fc43e137667 GenericResource iUTAH GAMUT Network Raw Data at Fort Douglas Storm Drain (RB_FortD_SD) files not in iRODs
3ebf244bd2084cfaa68b83b7f91e9587 GenericResource iUTAH GAMUT Network Raw Data at Trial Lake Climate Site (PR_TL_C) files not in iRODs
6aa75450ee2744cdb34ed8dde929a84a GenericResource iUTAH GAMUT Network Raw Data at Charleston Climate Site (PR_CH_C) files not in iRODs
887180409e4545018c8372f0bd6f8ff3 GenericResource iUTAH GAMUT Network Raw Data at Provo River near Charleston Advanced Aquatic Site (PR_CH_AA) files not in iRODs
4a5bb3a976004f0ea63991323335b170 GenericResource iUTAH GAMUT Network Raw Data at Beaver Divide Climate Site (PR_BD_C) files not in iRODs
ecb77926c2484e068f28acda434f8772 GenericResource iUTAH GAMUT Network Raw Data at Logan River near the Water Lab Advanced Aquatic Site (LR_WaterLab_AA) files not in iRODs
40655b4fc21142d090a5a4b835c14220 GenericResource iUTAH GAMUT Network Raw Data at Red Butte Gate Basic Aquatic Site (RB_RBG_BA) files not in iRODs
1846b79a648a4088aad987cc7241656f GenericResource iUTAH GAMUT Network Raw Data at Red Butte Creek near 900 W (1300 South) Basic Aquatic Site (RB_900W_BA) files not in iRODs
94bcad20fbfb4c44ac7f98a0fdfa5e79 GenericResource iUTAH GAMUT Network Raw Data at Blacksmith Fork above confluence with Logan River (BSF_CONF_BA) files not in iRODs
a22bbdfb431c44a68959534c94e96392 GenericResource iUTAH GAMUT Network Raw Data near Connor Road Storm Drain Site (RB_CR_SD) files not in iRODs
bc16655330b64bcaa366d464b00e45f0 GenericResource iUTAH GAMUT Network Raw Data near Dentistry Building Storm Drain (RB_Dent_SD) files not in iRODs
2e9db97be020401c9aa03017cb7ee505 GenericResource iUTAH GAMUT Network Raw Data near Green Infrastructure Storm Drain (RB_GIRF_SD) files not in iRODs
c3ecee31a0c64490bf6a2fcb4841cee4 GenericResource iUTAH GAMUT Network Raw Data at Red Butte Creek at 1300 East Aquatic (RB_1300E_A) files not in iRODs
a56608d8948c43fdb302e1438cf09169 GenericResource iUTAH GAMUT Network Raw Data at Lower Knowlton Fork Aquatic (RB_LKF_A) files not in iRODs
9700b80f5cfa42d4a52c9aaab81a4e11 CollectionResource Freshwaterhack project: Comparing spatial datasets files not in iRODs
cde532b5d39141db9c2b22122774afae GenericResource iUTAH GAMUT Network Quality Control Level 1 Data at Knowlton Fork Climate (RB_KF_C) files not in iRODs
86a27290e1b443a488f0b84cb9e2af91 GenericResource iUTAH GAMUT Network Quality Control Level 1 Data at Climate Station at Logan River Golf Course (LR_GC_C) files not in iRODs
bb41efc853134d0a90fa1da0041367f5 GenericResource iUTAH GAMUT Network Quality Control Level 1 Data at Lower Knowlton Fork Aquatic (RB_LKF_A) files not in iRODs
200a03e04591410f8b6310b43558634b GenericResource iUTAH GAMUT Network Quality Control Level 1 Data at Climate Station at TW Daniels Experimental Forest (LR_TWDEF_C) files not in iRODs
f83c4a6ddaec4085bd152dd261a1a89c GenericResource iUTAH GAMUT Network Quality Control Level 1 Data at Above Red Butte Reservoir Advanced Aquatic (RB_ARBR_AA) files not in iRODs
878093a81b284ac8a4f65948b1c597a2 GenericResource iUTAH GAMUT Network Raw Data at USGS Gage 10172200 above Red Butte Reservoir (RB_ARBR_USGS) files not in iRODs
b5f0873404b941ef982df72e90fc140c GenericResource iUTAH GAMUT Network Raw Data at Provo River at Charleston Central Utah Water Conservancy District Gage (PR_CH_CUWCD) files not in iRODs
7f0392828f01467386102ae4b52c3b5a NetcdfResource Spatial-temporal statistics of monthly soil moisture data from the NLDAS model (1979-2013) files not in Django
fef58369046c4a64a2d7564c4e7e1fd0 NetcdfResource Spatial-temporal statistics of monthly evapotranspiration data from the NLDAS model (1979-2013) missing from both Django and iRODs
fc00c8eaa0944a4a98ea2ddbfe54320e NetcdfResource Spatial-temporal statistics of monthly precipitation data from the NLDAS model (1979-2013) files not in Django
ba64d962eb6c460abc9a8628946df116 NetcdfResource Spatial-temporal statistics of monthly temperature data from the NLDAS model (1979-2013) files not in Django
3f354dd111f24998b37099ebdf478441 NetcdfResource Spatial-temporal statistics of monthly surface runoff data from the NLDAS model (1979-2013) files not in Django
c9fb977bae21432b8b202f13b62285b1 NetcdfResource Spatial-temporal statistics of daily soil moisture data from the NLDAS model (1979-2013) files missing from Django and iRODs
f42f1387d7d54d7a9228888381d7c30e NetcdfResource Spatial-temporal statistics of daily precipitation data from the NLDAS model (1979-2013) files missing from Django and iRODs
ff2e648104254ee4bcf8db925170ea91 NetcdfResource Spatial-temporal statistics of daily temperature data from the NLDAS model (1979-2013) files missing from Django and iRODs
fbc7af608a324a7a9cbbdd415d0a9499 NetcdfResource Spatial-temporal statistics of daily surface runoff data from the NLDAS model (1979-2013) files missing from Django and iRODs
fe6edc72a982454b8a86aacd7cfbaf74 GenericResource Test files missing from Django and iRODS
38a881fd35af49448b483f0343ca60e5 CollectionResource Freshwaterhack project: Comparing spatial datasets files not in iRODs
faacb77f1a8144c4a232edd8ffdd179b MODFLOWModelInstanceResource Metadata files missing from Django and iRODs
0183ec4000f644fa9378cf28cfe5c2e2 GenericResource Sauk River Basin Observatory files missing from Django and iRODs
e81b1fb3cb5a49538d6c2ad3077b7b71 GenericResource Test - REMOVE ME files missing from Django and iRODs
cdc6292fbee24dfd9810da7696a40dcf CompositeResource Comparison of hydrodynamic and low-complexity flood modeling tools files missing from Django and iRODs
3f7680cf83dc426e858d5b48cb95a565 GenericResource Green Infrastructure Designer with RHESSys Workflow files missing from Django and iRODs
dfae7f297db749ccac0c85a7bef56582 GenericResource Bayou Fountain iRODs resource missing
098bcea9945f4a00ba0be5a84096aa19 GenericResource Bayou Fountain iRODs resource missing
65593e64416b4fa2a6d58971546c9713 GenericResource Bayou Fountain iRODs resource missing
de42a9f014c344578d96c1717a520786 GeographicFeatureResource cTurnipseed_homewatershed iRODs resource missing
95b75ee546b2479c80e1895f95f6d2a1 GenericResource Chiamaka.Oyekwe-Madumelu_WaterShed iRODs resource missing
26b015134eb541b1a1c6587b71cd3fc8 GenericResource Fort Bend Hand Practice iRODs resource missing
e46ea1e2c3f24d5ba2f64ce356e241ce ModelProgramResource My new collection iRODs resource missing
06a765609dc74a5090290ef34682f4ba ModelInstanceResource ADCIRC - serial - testcase iRODs resource missing
d0cbc743fc8e4a16bce2a6377a182e41 CompositeResource HydroShare Overview: Managing and Sharing Research Data Using HydroShare iRODs resource missing
67607a3752514947b7eaa92a0ce6ef5f GenericResource Onion Creek iRODs resource missing
9990bc18925b429aae35c142bea235da NetcdfResource UEB model simulation of snow water equivalent in Logan River watershed from 2008 to 2009 iRODs resource missing
438d578db3e2426cb1a14a939d0b36f0 GenericResource Hello From JupyterHub iRODs resource missing
946d1e62ed4c457db15f679e6bedc258 GenericResource Hello From JupyterHub iRODs resource missing
a2d59cd6d696401e90c159bc965a3ca9 CollectionResource Presentations about HydroShare iRODs resource missing
26d30f89e31f481bb63a4e089dfa1340 CompositeResource HydroShare and Model Sharing: Presentation to IWRSS Model Registry Team, Nov 8, 2016 iRODs resource missing
23c05d3177654a9ab9dc9023d00d16ed CompositeResource Supporting files for python tool subset_nwm_netcdf 1.1.4 iRODs resource missing
2773de0c379d4df4bed0b301b4525382 GenericResource Sentinel-2 Spectral Response iRODs resource missing
271d64a09da9460c919603b7bd5e9b29 CompositeResource A Subset of NWM Ver1.1 20170419 results for TwoMileCreek watershed at Tuscaloosa, Alabama iRODs resource missing
fa3c6b47370e4367b5c71d36def1d4f4 GenericResource IDEAS for GI iRODS resource missing
7feec694d0b140b5991ce20135c1dcef GenericResource DeadRun Discharge Observation Data iRODs resource missing
2e295531907844b985d5c1b95bf65420 GenericResource BoxElderCounty iRODs resource missing
b43e212dc4af45f0958bee1e94f6949e GenericResource DeadRun RHESSys model results iRODs resource missing

Scroll to right in table to see what's wrong. The whole table didn't fit. Please suggest actions to take.

Key to the above list:

notation meaning
iRODs resource missing whole resource tree (starting with short_id) is not present in iRODs
files not in iRODs There are files present in Django ResourceFiles that are not in iRODS
files not in Django There are files present in iRODs that are missing from Django
files missing from Django and iRODS There are files in Django that are not in iRODS, and other files that are in iRODs but not in Django
alvacouch commented 7 years ago

@dtarb @hyi @pkdash @mjstealey @horsburgh

Jeff, please weigh in on the above list. Which of the above resources are yours that were being manipulated by the REST API?

After sleeping on this issue, I have the following recommendations:

condition likely cause recommendations
File not present in Django failed ResourceFile.delete in REST API leaves files behind -- fixed in 1.10.0 delete file in iRODS using cleanup script.
File not present in iRODS reason unknown; best (unverified) guess is that the uploaded file was too large research individually, make it possible to delete these in the REST API.
Resource tree not present failed delete_resource is best theory delete resource in Django

Comments:

  1. During the ResourceFile fix, I discovered that the ResourceFile.delete (as used in the REST API) was leaving iRODS files behind. This is likely the cause of "files not in Django". Recommendation is to automagically clean these up without notifying anyone. User would try to delete the file in the REST API and the file would still be there. This is the only way this could have happened, AFAIK.
  2. Short of botches in transcription, the only way the code can delete a resource entirely is through delete_resource. This routine deletes the iRODs stuff first, and then deletes the Django. My working theory is that the code crashed after deleting from iRODs. Cleanup is to delete the resource from Django, which may require some adjustment due to the fact that the resource files are already deleted.

This course of action requires three hotfixes (because they have to run on www):

  1. Add deleting orphaned files in iRODs to the cleanup scripts. Easy.
  2. Make it possible to delete a resource from Django that has already been deleted from iRODs. This requires catching the exceptions that will be thrown when the resource has already been deleted from iRODs.
  3. Make it possible to delete a ResourceFile from Django that is already not present in iRODs. Same annotation as above.

Your thoughts?

alvacouch commented 7 years ago

There is one case in which a pair of errors ('missing in iRODs' and 'missing in Django') is rather obviously the result of a botched move_or_rename. I think I fixed this in 1.10.0.

In the other cases where there is an object in Django that is not in iRODs, provenance is not so clear. Translation: I might not have fixed whatever did this in 1.10.0. Best theory so far is that there was an attempt to upload something too large.

There was a general problem on www that -- since iRODs paths to files were manipulated in the application rather than the ResourceFile API -- the paths tended to get messed up by application programmers, which means that when one wants to delete the ResourceFile in iRODS, the path is incorrect and the delete fails in iRODs. I moved path handling to the ResourceFile API and this problem is gone in 1.10.0. This probably accounts for all "missing in Django" errors.

AmberSJones commented 7 years ago

@alvacouch , the resources that our group (with @horsburgh ) created/modified with the API are all those with the naming convention "iUTAH GAMUT Network ..."

pkdash commented 7 years ago

@alvacouch

During the ResourceFile fix, I discovered that the ResourceFile.delete (as used in the REST API) was leaving iRODS files behind. This is likely the cause of "files not in Django".

This the internal api (hydroshare.delete_resource_file()) is being used both in REST API as well as the view function used by the UI. So I am not sure why in case of REST api only the files won't be deleted from iRODS.

From the iUTAH workflow description that I got from Kenny, it seems that his script tries to delete a file only if that file is reported to be in Django since the REST api for listing files for a resource generates the list from Django.

alvacouch commented 7 years ago

@pkdash The REST API is simply where I saw the error during testing. It is not necessarily the only expression of that error. But I saw it there and the tests for the REST endpoints are the ones that validate correct function.

I did not go back and put extra tests into hydroshare.delete_resource_file

(which, if we were adhering to best practices for naming, would be named ResourceFile.delete instead of what is there now, which is a policy-free delete that bypasses those rules).

alvacouch commented 7 years ago

@AmberSJones @horsburgh May we have permission to clean up your resources by deleting files that are only present on one side (in Django or in iRODs but not both)? See detailed log for the identities of these files (in link before long listing).

alvacouch commented 7 years ago

@AmberSJones @Horsburgh Actually, the errors in your resources are a bit of a mystery to me. Your iRODs files are missing. I don't know how that could happen. Would it be possible to try and upload and see if you can reproduce such errors? I wonder why they're happening.

horsburgh commented 7 years ago

@alvacouch - yes, go ahead and clean any of the ones that start with "iUTAH GAMUT". These are all being automatically generated by Kenny's script, so even if we had to regenerate them, we could. There's a couple of other iUTAH-related ones in your list, but we may want to look at those more closely.

horsburgh commented 7 years ago

@alvacouch - and yes, Kenny is working right now on the code that generates/updates the files these resources. I'm hoping he figures the issue he's got there out soon so he can turn it back on and continue updating the files.

alvacouch commented 7 years ago

@ChristinaB Could you list the bags that won't download here? Do any of these correspond to the resources above (that are very, very broken)?

alvacouch commented 7 years ago

@ChristinaB I actually think that the things you're seeing might be different. I will start a new issue.

ChristinaB commented 7 years ago

Starter list of resources (Tony will work on code) when I click on 'Download All Content as Zipped Bagit Archive. with the error "Please wait for the resource bag to be created....." , but never progresses or completes.

dtarb commented 7 years ago

I generally agree with the approaches suggested above to resolve problems.

For files with a record entry in Django, but not in IRODS, there is no option for recovery, so we should just delete the file record entry in Django. In the tests I have done I have not found an error on the UI (not to say one is not there – I just did not find it). I have not found resources apart from the ones from Christina above where I am unable to download a file that appears on the UI.

For files in iRODS but with no corresponding resource in Django we should archive them somewhere then delete them from the system. Working from the Archive we should see if resourcemetadata.xml can tell us who the owner/creators were and the nature of the files. If I can get a list I can make a judgment call as to whether we need to try contact the user. It is likely that these are delete’s that failed so we will not have to do anything.

For files in iRODS with a resource in Django, but where the Django and iRODS file listings are different. Set the Django listings to be consistent with iRODS and regenerate bag and metadata files, or set the flag to false so these are regenerated on demand. I think that resources with “Files not in Django” and “files missing from Django and iRODs” are in this category, as checking a few of these on the UI, files do appear on the UI and are downloadable (at least for the few I checked).

I was able to go to resources listed as “iRODs resource missing” and download files and the bag. So I am not sure what this error indicates.

alvacouch commented 7 years ago

@dtarb @pkdash @hyi @mjstealey I am closing this issue with some final comments.

Most of the problems above were synchronization problems, solved by carefully synchronizing dumps of Django and iRODs. The remaining (real) problems were fixed on a one-off basis in PR #2090. There remains some concern that the UI is corrupting Christina's resources due to the complexity of what she asks the UI to do. See issue #2095 and PR #2100 for a beginning to debugging that.

It is not theoretically possible to synchronize the beta environment with production on federated resources. The reason for this is that federated resources cannot be "copied" to the test environment; they're too large. Thus the production and beta environments continue to modify the same federated resources asynchronously and they will get out of sync when that happens. Thus, tests of whether federated resources are synchronized on beta are not feasible. The mechanism in PR #2100 can be used to test whether they are corrupted in production.

alvacouch commented 7 years ago

@Castronova @ChristinaB See issue #2056 for remaining issue concerning bag download (which doesn't seem to have much to do with this issue).