Closed sr320 closed 3 years ago
If you have access to the server:
rsync -av <source> <destination>
: download filersync -av <source> <destination>
: repeat to check if file is identicalecho $?
: check exit status to see if problems occurred with rsyncshasum -c checksums_file.sha
If data is only accessible via URL:
wget <specify which data you want to download> http://datasite/data_location.html
or curl http://datasite/data_location/datafile01.txt > datafile1.txt
shasum -c checksums_file.sha
For synchronizing entire directories you could use:
rsync -avz -e ssh <file source> <file destination>
wget <url file location>
curl <url file location>
For checking data integrity:
shasum <some file>
will return the checksum, which can be compared to the checksum provided by the data sourceshasum data/*fastq > fastq_checksums.sha
would create a single checksums file containing all checksums of .fastq filesdiff -u <file1> <file2>
outputs a unified diff format which summarizes the differences between files If dataset is available via URL (Similar to the Blast in Jupyter walk through):
curl <url> \
> directory you want the file to go
gunzip
commandTo check integrity (IDK if this is correct):
head
command to look at the first portion and see if that very basically checks outAlternatively you can download the file use wget
and run md5Sum myFile to check the integrity
wget -0 - URL | tee myFile | md5sum > MD5SUM
Also as Joe said already you can use shasum to print or check SHA Checksums
There are a few options when downloading data. You can use wget for downloading data and is good for downloading in HTTP and FTP format. It can download recursivly so you should limit it using --no parent and --limit. Curl is also used to download data, especially SFTP and CTP files. For larger and slower downloads it is good to use Rsync, which is better at synchronizing entire directories.
To ensure the integrity of a file, you can preform sum-checks. These are summaries of the data which will show if any of the data has changed. shasum and md5 are the two sum-check algorithms discussed in the text. Also, you can preform a difference check using diff which works line by line and notifies you of lines which differ between the two files.
Download using wget
(http or FTP; benefit of this over curl is the --recursive command), curl
(http, FTP, SFTP or SCP; benefit over wget is more transfer protocols work and you can follow page redirects), or rsync
(slower, but more heavy duty than curl or wget. Best for big data and can compress during transfers).
To check data integrity: if you ran rsync you can run it again to check everything is sync'd properly and that nothing has changed between the downloaded data and the source data. Check the exit status to make sure no errors occurred during transfer. Utilize check-sums such as SHA-1 (shasum
) or MD5 (md5sum
). If there is a difference found, you can pinpoint it using diff
From the command line, data can be downloaded in two ways: directly from the web using the commands wget
(downloads related files) or curl
(follows file redirections) and from a server using rsync
, which is better suited for larger files and helps with synchronizing changes in files and directories.
To make sure my downloaded file matches the original one, I would use the shasum
or md5
commands. These functions report hexadecimal numbers like barcodes unique to any files, which can then be compared to the source ones using diff
to point out differences between these files.
At the command line there are a couple different ways to download a large data set. You can use wget
or curl
which is used for data directly from the internet. The difference between wget
and curl
is that wget
downloads related files and curl
can follow file redirections. You can also use rsync
which is good for very large data sets.
To make sure nothing happened to the data in the download process you can use checksums like the command shasum
. From there you can find the specific differences using diff
.
How would you download a large data set at the command line and ensure the integrity of the data was maintained (i.e. the file you downloaded is the exact same as on the server)?