Getting data - Githubissues

sr320 commented 3 years ago

How would you download a large data set at the command line and ensure the integrity of the data was maintained (i.e. the file you downloaded is the exact same as on the server)?

aspencoyle commented 3 years ago

If you have access to the server:

rsync -av <source> <destination> : download file
rsync -av <source> <destination> : repeat to check if file is identical
echo $? : check exit status to see if problems occurred with rsync
If a file of SHA-1 checksums is available, check if they match with shasum -c checksums_file.sha
If no checksums file, look at README file to see if checksums are available there instead

If data is only accessible via URL:

Depending on protocols used, use wget <specify which data you want to download> http://datasite/data_location.html or curl http://datasite/data_location/datafile01.txt > datafile1.txt
Assuming a file of checksums is available, check if they match with shasum -c checksums_file.sha
If no checksums file, look at README file to see if checksums are available there

jdduprey commented 3 years ago

For synchronizing entire directories you could use:

rsync -avz -e ssh <file source> <file destination>
no trailing slash means copy the entire directory, trailing slash means copy the contents of the directory For quickly downloading open source files:
wget <url file location>
or curl <url file location>
(packages must be installed for these commands to work)

For checking data integrity:

shasum <some file> will return the checksum, which can be compared to the checksum provided by the data source
shasum data/*fastq > fastq_checksums.sha would create a single checksums file containing all checksums of .fastq files
diff -u <file1> <file2> outputs a unified diff format which summarizes the differences between files

skreling commented 3 years ago

If dataset is available via URL (Similar to the Blast in Jupyter walk through):

curl <url> \ 
> directory you want the file to go

If file is not in the correct form or if it is compressed, you'll need to uncompress, commonly .gz files use the gunzip command

To check integrity (IDK if this is correct):

some first easy steps are to use head command to look at the first portion and see if that very basically checks out

Alternatively you can download the file use wget and run md5Sum myFile to check the integrity wget -0 - URL | tee myFile | md5sum > MD5SUM

Also as Joe said already you can use shasum to print or check SHA Checksums

dippelmax commented 3 years ago

There are a few options when downloading data. You can use wget for downloading data and is good for downloading in HTTP and FTP format. It can download recursivly so you should limit it using --no parent and --limit. Curl is also used to download data, especially SFTP and CTP files. For larger and slower downloads it is good to use Rsync, which is better at synchronizing entire directories.

To ensure the integrity of a file, you can preform sum-checks. These are summaries of the data which will show if any of the data has changed. shasum and md5 are the two sum-check algorithms discussed in the text. Also, you can preform a difference check using diff which works line by line and notifies you of lines which differ between the two files.

meganewing commented 3 years ago

Download using wget (http or FTP; benefit of this over curl is the --recursive command), curl (http, FTP, SFTP or SCP; benefit over wget is more transfer protocols work and you can follow page redirects), or rsync (slower, but more heavy duty than curl or wget. Best for big data and can compress during transfers).

To check data integrity: if you ran rsync you can run it again to check everything is sync'd properly and that nothing has changed between the downloaded data and the source data. Check the exit status to make sure no errors occurred during transfer. Utilize check-sums such as SHA-1 (shasum) or MD5 (md5sum). If there is a difference found, you can pinpoint it using diff

Brybrio commented 3 years ago

From the command line, data can be downloaded in two ways: directly from the web using the commands wget (downloads related files) or curl (follows file redirections) and from a server using rsync, which is better suited for larger files and helps with synchronizing changes in files and directories. To make sure my downloaded file matches the original one, I would use the shasum or md5 commands. These functions report hexadecimal numbers like barcodes unique to any files, which can then be compared to the source ones using diff to point out differences between these files.

laurel-nave-powers commented 3 years ago

At the command line there are a couple different ways to download a large data set. You can use wget or curl which is used for data directly from the internet. The difference between wget and curl is that wget downloads related files and curl can follow file redirections. You can also use rsync which is good for very large data sets. To make sure nothing happened to the data in the download process you can use checksums like the command shasum. From there you can find the specific differences using diff.

sr320 / course-fish546-2021

Getting data #11