ncihtan / hdash_air

MIT License
0 stars 0 forks source link

clear hdash file cache and rerun from scratch? #18

Closed alexeliotlash closed 6 months ago

alexeliotlash commented 7 months ago

Seems there are some file error alerts persisting even after they've been corrected by centers. For example, on https://hdash.website-us-east-1.linodeobjects.com/HTA10.html there are 398 validation errors. There are multiple "links connect" errors referring to ID=JP-TRANS-2.

Looking at one of those errors “HTA10_0000_06191 references parent ID=JP-TRANS-2, but no such ID exists. [Error occurred while processing file: syn51703772 of type MassSpectrometryLevel1]." when I freshly download syn51703772 from Synapse I can't find any references JP-TRANS-2.

We thought perhaps hdash is caching files somewhere locally and this cache might need to be cleared before hdash could be run from scratch?

ecerami commented 7 months ago

I don't yet have an answer on this, but I also downloaded syn51703772, and I also couldn't find JP-TRANS-2. So, at least we see the same thing so far :-)

ecerami commented 7 months ago

Ok... I have no confirmed that hdash has a meta_cache table, and I see 5 records for syn51703772. So, could be a caching issue.

ecerami commented 7 months ago

Hi Alex, this is actually not a caching issue.

Rather, we have two files: syn39282351

and syn51703772

Both files use the same primary HTAN IDs, e.g. both files claim to be annotating HTA10_0000_06037. This is not allowed, and it is causing an error in the validation message, e.g.

HTA10_0000_06037 references parent ID=JP_Desc_3, but no such ID exists. [Error occurred while processing file: syn51703772 of type MassSpectrometryLevel1].

because HTA10_0000_06037 is defined as the primary key in both files, hdash cannot distinguish between the two files. But, if you look at syn39282351, you will see that HTA10_0000_06037 does reference parent ID=JP_Desc_3.

The error message should probably say:

HTA10_0000_06037 references parent ID=JP_Desc_3, but no such ID exists. [Error occurred while processing file: syn39282351 of type OtherAssay].

To fix this issue, I think we have to fix the meta files themselves.

ecerami commented 7 months ago

This might be opening a can of worms, but I now added a duplicate Primary ID check.

See: https://hdash.website-us-east-1.linodeobjects.com/HTA10.html

Good news is that we now have the root problem:

Primary ID HTA10_0000_06191 has already been defined in OtherAssay. [Error occurred while processing file: syn51703772 of type MassSpectrometryLevel1].

Bad news is that Stanford has lots of duplicate primary IDs.

alexeliotlash commented 7 months ago

To the error message of the duplicate Primary ID check, could you add the filename of the file you're processing? For example, from "Primary ID HTA10_07_00102001 has already been defined in BulkRNA-seqLevel1. [Error occurred while processing file: syn39282161 of type BulkRNA-seqLevel1]." to something like "Primary ID HTA10_07_00102001 has already been defined in file synXXXXXXXX of type BulkRNA-seqLevel1. [Error occurred while processing file: syn39282161 of type BulkRNA-seqLevel1]." ?

ecerami commented 6 months ago

@alexeliotlash I added this suggestion, you can see here: https://hdash.website-us-east-1.linodeobjects.com/HTA10.html

are we good to close this issue now?

alexeliotlash commented 6 months ago

Looks good. You can close the issue. Thanks.

ecerami commented 6 months ago

Closing!! :-)