sr320 / course-fish546-2021

1 stars 1 forks source link

Getting data #11

Closed sr320 closed 3 years ago

sr320 commented 3 years ago

How would you download a large data set at the command line and ensure the integrity of the data was maintained (i.e. the file you downloaded is the exact same as on the server)?

aspencoyle commented 3 years ago

If you have access to the server:

If data is only accessible via URL:

jdduprey commented 3 years ago

For synchronizing entire directories you could use:

For checking data integrity:

skreling commented 3 years ago

If dataset is available via URL (Similar to the Blast in Jupyter walk through):

curl <url> \ 
> directory you want the file to go

To check integrity (IDK if this is correct):

Alternatively you can download the file use wget and run md5Sum myFile to check the integrity wget -0 - URL | tee myFile | md5sum > MD5SUM

Also as Joe said already you can use shasum to print or check SHA Checksums

dippelmax commented 3 years ago

There are a few options when downloading data. You can use wget for downloading data and is good for downloading in HTTP and FTP format. It can download recursivly so you should limit it using --no parent and --limit. Curl is also used to download data, especially SFTP and CTP files. For larger and slower downloads it is good to use Rsync, which is better at synchronizing entire directories.

To ensure the integrity of a file, you can preform sum-checks. These are summaries of the data which will show if any of the data has changed. shasum and md5 are the two sum-check algorithms discussed in the text. Also, you can preform a difference check using diff which works line by line and notifies you of lines which differ between the two files.

meganewing commented 3 years ago

Download using wget (http or FTP; benefit of this over curl is the --recursive command), curl (http, FTP, SFTP or SCP; benefit over wget is more transfer protocols work and you can follow page redirects), or rsync (slower, but more heavy duty than curl or wget. Best for big data and can compress during transfers).

To check data integrity: if you ran rsync you can run it again to check everything is sync'd properly and that nothing has changed between the downloaded data and the source data. Check the exit status to make sure no errors occurred during transfer. Utilize check-sums such as SHA-1 (shasum) or MD5 (md5sum). If there is a difference found, you can pinpoint it using diff

Brybrio commented 3 years ago

From the command line, data can be downloaded in two ways: directly from the web using the commands wget (downloads related files) or curl (follows file redirections) and from a server using rsync, which is better suited for larger files and helps with synchronizing changes in files and directories. To make sure my downloaded file matches the original one, I would use the shasum or md5 commands. These functions report hexadecimal numbers like barcodes unique to any files, which can then be compared to the source ones using diff to point out differences between these files.

laurel-nave-powers commented 3 years ago

At the command line there are a couple different ways to download a large data set. You can use wget or curl which is used for data directly from the internet. The difference between wget and curl is that wget downloads related files and curl can follow file redirections. You can also use rsync which is good for very large data sets. To make sure nothing happened to the data in the download process you can use checksums like the command shasum. From there you can find the specific differences using diff.