helxplatform / dug

Semantic Search
MIT License
32 stars 10 forks source link

Updated BDC dbGaP IDs to the latest from BDC Gen3 #343

Closed gaurav closed 7 months ago

gaurav commented 8 months ago

This PR updates the data/bdc_dbgap_ids.csv file with the latest dbGaP identifiers from the BDC Gen3 instance. It also fixes some issues with bin/get_dbgap_data_dicts.py when downloading from FTP:

  1. We used to get the list of files in a directory from FTP, download the files from the corresponding HTTP server, and then try to get another list of files from FTP. But in between the two steps the FTP server times out and disconnects. We now explicitly close the connection after getting the list of files, then open it again before getting the next list of files.
  2. If a download fails, we now try to download the local directory for that variable as it will either be empty or incomplete. Re-running the script causes any variables not already downloaded to be downloaded again.