Closed hschellman closed 1 year ago
I also have a test suite that can reproduce this but needs some setup. Maybe go over that on Tuesday.
On Jan 1, 2023, at 9:29 AM, Heidi Schellman @.**@.>> wrote:
[This email originated from outside of OSU. Use caution with links and attachments.]
datasets which include FNAL and some other site seem to go into a "not-found" status but our code doesn't know how to catch that.
So 2 questions
what is not-found and why does it happen
this seems to be discovered early on in creation/worker attachment - how do we catch this and terminate to avoid wasting wall time doing retries.
https://metacat.fnal.gov:9443/dune/dd/gui/P/project?project_id=406https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmetacat.fnal.gov%3A9443%2Fdune%2Fdd%2Fgui%2FP%2Fproject%3Fproject_id%3D406&data=05%7C01%7Cheidi.schellman%40oregonstate.edu%7C9b685b171b954b21753308daec1dbe54%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638081909734445784%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NPWgsv8OJZf8nKt2wDuRfpsNrwfp5yoHeeTJTy14ugk%3D&reserved=0
is an example.
— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fivmfnal%2Fdata_dispatcher%2Fissues%2F8&data=05%7C01%7Cheidi.schellman%40oregonstate.edu%7C9b685b171b954b21753308daec1dbe54%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638081909734445784%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OtFRtpIa%2FyPYe4p2AmjzGAek11DOsFW1XBSTErcBEu8%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAIA37DO7NFVFVX3F5L55W5DWQG5HXANCNFSM6AAAAAATOHD5NY&data=05%7C01%7Cheidi.schellman%40oregonstate.edu%7C9b685b171b954b21753308daec1dbe54%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638081909734445784%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=390EtfEvhmQ%2BH9Fschcrw%2FJB5PAnGnEH0ZUyGAuY%2BgU%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>
The code I'm using to set up the project is from Jake:
createProject in https://github.com/hschellman/DataChallengeWork-loginator/blob/develop/python/submit_dd_jobs.py
Datasets do not go into not-found state. Files do. If a project includes a file, which currently has no replicas in any RSE known to the DD instance, then the file availability sate is "not found". DD will re-query Rucio periodically for updated replica information and once a replica for the file is found in one of known RSEs, the file availability state goes into "found" or "available". So "not found" is not a terminal state. In fact all new files of new projects are "not found" initially until the DD discovers the new project and polls Rucio for its file replicas and marks some (or all or none) of them as found.
We avoid this by configuring the DD instance to "know" all relevant RSEs so that files we want to process are found in or will be delivered soon to one of the known RSEs.
datasets which include FNAL and some other site seem to go into a "not-found" status but our code doesn't know how to catch that.
So 2 questions
what is not-found and why does it happen
this seems to be discovered early on in creation/worker attachment - how do we catch this and terminate to avoid wasting wall time doing retries.
https://metacat.fnal.gov:9443/dune/dd/gui/P/project?project_id=406
is an example.