gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

Refactor ingest/extract process #122

Open dolsysmith opened 3 years ago

dolsysmith commented 3 years ago

See also #117 and #121.

Problem

Depending on the parameters, user-generated extracts from large datasets can take a long time (hours or days) and consume a lot of disk space (~1TB). We have implemented a check to prevent the creation of multiple versions of a full (non-parametrized) extract, but we have nothing in place to prevent users from creating multiple versions of the same parametrized extract.

In addition, the extract jobs do not fail gracefully when interrupted by a critical error (such as a lack of disk space).

Questions

  1. What use cases do these custom extracts satisfy?
  2. Would we still meet most use cases if the app placed a reasonable limit on the size of custom extracts, given that full extracts are also available?

Proposal

Full extracts

Benefits

Workflow diagram

Further documentation (including testing) in this notebook.

dolsysmith commented 3 years ago

Dev Setup Instructions

Updated 7/28/21

For development on this ticket, it will be necessary to set up an NFS on your dev environment, similar to the shared /storage/dataset_loading directory on prod. A shared filesystem is necessary for persisting data to disk from a Spark cluster.

To set up an NFS between two dev VM's, I used the following method (adapted from these instructions).

  1. Install the NFS server package on one VM (henceforth the NFS server):
      sudo apt update
      sudo apt install nfs-kernel-server
  2. Configure a directory to be exported. I'm using a directory on the /storage volume, which has adequate space for TweetSets testing. a. Create a directory to share: sudo mkdir /storage/dataset_loading b. Set yourself as the directory's owner: sudo chown dsmith:dsmith /storage/dataset_loading c. Edit the /etc/exports file: sudo nano /etc/exports d.. Add a line at the bottom mapping a directory on the server to the client VM, using the client VM's IP address. (This should be the IP address internal to the WRLC network, not the exterrnal IP address. On my VM's, it's associated with the ens160 interface.) Use values for your_uid and your_gid from step c. (This will give the client the same privileges to the shared folder as those associated with your user account on the server.)
    
    /storage/dataset_loading         172.27.20.233(rw,sync,no_root_squash,no_subtree_check)
  3. Restart the NFS server: sudo systemctl restart nfs-kernel-server. (If you get an error from this command, you can view the service log by running sudo journalctl -u nfs-server.service).
  4. On the client VM, install the nfs-common package:
    
    sudo apt update
    sudo apt install nfs-common
5. Create a mount point on the client (if it doesn't already exist) and mount the NFS directory (from the server) to that directory. In this case, I'm mounting the `dataset_loading` directory on the server VM to the same directory on the client. (I believe this operation will overwrite any data in the local directory if it exists.) Note that the IP address here is that of the **server** VM.  

sudo mkdir /storage/dataset_loading sudo mount 172.27.20.231:/storage/dataset_loading /storage/dataset_loading

6. To persist this change on reboot, add the following line to the `/etc/fstab` file on the client VM:

172.27.20.231:/storage/dataset_loading /storage/dataset_loading nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0

dolsysmith commented 3 years ago

Other considerations:

Spark DataFrame API

Spark/pyspark version

dolsysmith commented 3 years ago

Regarding our use of the Elasticsearch Scroll API to retrieve the results for the user-generated extracts, please note that in more recent version of Elasticsearch, the use of this API for deep pagination is discouraged:

We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).