Open dolsysmith opened 3 years ago
Updated 7/28/21
For development on this ticket, it will be necessary to set up an NFS on your dev environment, similar to the shared /storage/dataset_loading
directory on prod. A shared filesystem is necessary for persisting data to disk from a Spark cluster.
To set up an NFS between two dev VM's, I used the following method (adapted from these instructions).
sudo apt update
sudo apt install nfs-kernel-server
/storage
volume, which has adequate space for TweetSets testing.
a. Create a directory to share: sudo mkdir /storage/dataset_loading
b. Set yourself as the directory's owner: sudo chown dsmith:dsmith /storage/dataset_loading
c. Edit the /etc/exports
file: sudo nano /etc/exports
d.. Add a line at the bottom mapping a directory on the server to the client VM, using the client VM's IP address. (This should be the IP address internal to the WRLC network, not the exterrnal IP address. On my VM's, it's associated with the ens160
interface.) Use values for your_uid
and your_gid
from step c. (This will give the client the same privileges to the shared folder as those associated with your user account on the server.)
/storage/dataset_loading 172.27.20.233(rw,sync,no_root_squash,no_subtree_check)
sudo systemctl restart nfs-kernel-server
. (If you get an error from this command, you can view the service log by running sudo journalctl -u nfs-server.service
).nfs-common
package:
sudo apt update
sudo apt install nfs-common
5. Create a mount point on the client (if it doesn't already exist) and mount the NFS directory (from the server) to that directory. In this case, I'm mounting the `dataset_loading` directory on the server VM to the same directory on the client. (I believe this operation will overwrite any data in the local directory if it exists.) Note that the IP address here is that of the **server** VM.
sudo mkdir /storage/dataset_loading sudo mount 172.27.20.231:/storage/dataset_loading /storage/dataset_loading
6. To persist this change on reboot, add the following line to the `/etc/fstab` file on the client VM:
172.27.20.231:/storage/dataset_loading /storage/dataset_loading nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
Other considerations:
arrays_zip
Spark function used in creating the mentions extracts requires version 2.4. TweetSets currently uses version 2.3.2; I believe we could upgrade without any breaking changes, but we would need to test.es-hadoop
library as well.Regarding our use of the Elasticsearch Scroll API to retrieve the results for the user-generated extracts, please note that in more recent version of Elasticsearch, the use of this API for deep pagination is discouraged:
We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).
See also #117 and #121.
Problem
Depending on the parameters, user-generated extracts from large datasets can take a long time (hours or days) and consume a lot of disk space (~1TB). We have implemented a check to prevent the creation of multiple versions of a full (non-parametrized) extract, but we have nothing in place to prevent users from creating multiple versions of the same parametrized extract.
In addition, the extract jobs do not fail gracefully when interrupted by a critical error (such as a lack of disk space).
Questions
Proposal
Full extracts
mentions
datasets (currently disabled).Custom extracts
Benefits
Workflow diagram
Further documentation (including testing) in this notebook.