sul-dlss / libsys-airflow

Airflow DAGS for migrating and managing ILS data into FOLIO along with other LibSys workflows
Apache License 2.0
5 stars 0 forks source link

Extend SFTP Provider for retrieving new Vendor Files #261

Closed jermnelson closed 1 year ago

jermnelson commented 1 year ago

Extend SFTP to support downloading MARC and other files from a vendor.

From the Vendor Management App, retrieves connection details using the Organization's interface Okapi endpoint.

From Vendor data processing details requirement document, we need to be able to do the following:

  1. Get list of files for given directory
    1. Check to see if the file has already been processed (may require a REST API call to the Vendor Management App)
    2. File may have a date (@ahafele confirm?) to indicate that it should be skipped in favor of other files in that directory
    3. For some vendors, check the existence of corresponding files before retrieving the target file
    4. Select file based on regex for filename format sometimes there are multiple kinds of files in one directory (maybe push into a different XCOMs for processing by download tasks (i.e. some vendor files are PDFs and would not be loaded into FOLIO as bib records but would need to move to a different server mount)
ahafele commented 1 year ago

ii. I think we only look at the date in the filename for marcit and we are not sure that is still needed. I can't think of why we would need to do that still but I will confirm.

iii. Sure, right now we only need this for 1 vendor - GOBI. Order files are built as each order comes in and we know they are done by the existence of a count file. Current processing looks for this corresponding .cnt first then ftps the .ord file.

justinlittman commented 1 year ago

Questions:

  1. Where is the FTP info for a vendor retrieved from?
  2. What is the given directory?
  3. How do we know if a file has already been processed?
  4. How do we know what the corresponding file is?
  5. Where does regex for filename come from?
justinlittman commented 1 year ago

More questions:

  1. Is there existing code that we can look at?
  2. Where can we get credentials for vendors to try this out?
ahafele commented 1 year ago

More questions:

1. Is there existing code that we can look at?

https://drive.google.com/drive/folders/1hzgitvzgtyeaI-7-W0RNILwFCWySHC57?usp=share_link

  1. Where can we get credentials for vendors to try this out? GOBI is the exemplar vendor right now. details are here - https://docs.google.com/document/d/11JzLJb9kbW4u3dDLZ9_oFtsRtmnPkl0laS4zigQh5X0/edit#bookmark=id.8gbz5q30ybkw Waiting on approval from Darsi to hit the other vendors' ftp. Credentials will be stored in Organizations/interfaces.

@jermnelson could you address the other questions.

justinlittman commented 1 year ago

I'm going to work on a function for getting credentials from Folio.

jmartin-sul commented 1 year ago

an example from storytime:

question about "if the airflow app comes across a file, do we know about it, has it already been fetched?" still a bit unclear. should we explicitly track what we've gotten? is it implicit by what's already on a file system? should airflow care about not retrieving things we've seen, or does it get everything that fits the regex and then vendor mgmt app de-dupes... later? or airflow DAG writes to shared storage and just declines to overwrite anything that's already there? always grabs latest and there's always a new file, so not an issue?

answer: data import app tracks what has been processed. but still have above open question about coordination between airflow and data management app about e.g. how to not re-get something that's already been obtained. one possibility is that the airflow retrieval task or tasks see what files are available, asks the data management app which of them it should get (which might be none, if all have already been processed).

jmartin-sul commented 1 year ago

Storytime note: would like more clarity on what the workflow is at a software service level -- i.e. what the data mgmt app tracks, what the Airflow app is aware of, when they talk to one another. Airflow polls data mgmt app on a scheduled basis, gets work to do. But still a bit unclear on what state Airflow itself tracks, and how persistent it is.

jmartin-sul commented 1 year ago

More possibilities for tracking what's been retrieved:

jmartin-sul commented 1 year ago

possibly helpful for this ticket: https://github.com/sul-dlss/FOLIO-Project-Stanford/wiki/Vendor-Management---FOLIO---Airflow-Interaction-Diagram

justinlittman commented 1 year ago

I'm going to start with the simplest possible Task.

Given the inputs:

  1. Get the FTP credentials from the directory.
  2. FTP to the remote server.
  3. Change to the directory.
  4. Download all files that satisfy the pattern. Files will be downloaded to <shared mount>/files/<organization id>/<YYYY-MM-DD>/.
justinlittman commented 1 year ago

This has been supplanted by other tickets.