Closed sr320 closed 5 years ago
For downloading big data, I could use two command-line programs, wget or curl. For quick downloads of HTTP or FTP files I would use wget, if I wanted to download files using secure protocols (SFTP or SCP) I would use curl. Also, if it is to my advantage to get the downloaded file written to standard output (not quite sure what this means?), I would use curl. There other advantages to using curl, like it can follow page redirects, and it is also a library.
To check data integrity, I would use either of the two checksum algoriths, SHA-1 or MD5 to check if downloaded files are different. To check for specific difference among files I would use the Unix too diff
To download large data, I would use wget
or curl
, depending on what kind of files they are. If they are HTTP or FTP, I'd use wget
. If they are other file types, I'd use curl
.
If my data download finished without any error notifications, I would check data integrity, using the checksums, SHA-1. It compares the downloaded file to the file that you sourced it from. If there is any difference between the two files, the output for SHA-1 will be completely different between the two files. You want to have the same output, that way you know that you successfully downloaded the data in its entirety with no mistakes. Mistakes can happen with downloading large data sets because they take a long time, which increases the risk of dropped network connections during the process, which could result in data loss.
Download data: For HTTP or FTP links, I could use wget
or curl
. curl
can also be used to securely download files using SFTP and SCP protocols.
Ensure integrity was maintained: Use checksums (either shasum
or md5sum
). If the checksum for the downloaded file is different than it should be, the file could have been corrupted during the download.
Download data
wget
or curl
to download from http or ftp (curl can download files using SFTP and SCP)Check integrity of data
grep/zgrep "^>"
to check that everything downloadeddiff/zdiff
to find the differenceTo download data, use curl or wget depending on file types. To ensure integrity, use a checksum and compare the downloaded data's code with source data's code. If they don;t match, use diff to find what the difference is.
You can download data from the command line in one of two ways. First is to use wget, and second is curl. Wget downloads your data from the command line and puts it in your current directory. It is useful for http files and ftp files. Curl is similar, but it converts the file to standard output.
The integrity of the data is checked using checksums. They are compressed versions of the data that will be able to tell you in something in the file has been changed. The two most common checksums are SHA and MD5. SHA-1 is a new version of checksum than MD5.
I would download data from HTTP and FTP files using either wget
or curl
and from SFTP and SCP file types using only curl
. wget
downloads data to the current directory while curl
downloads data to the standard output so you usually redirect it.
To check data integrity I would use a checksum algorithm,: either SHA-1 using the program shasum
or MD5 using the program md5sum
. This will tell me if my downloaded file differs from the original data. If they don't match the diff
command will show me the lines that differ between the files (plus however many lines above and below the difference that I ask it to show).
For downloading data I would use either wget or curl. In order to check integrity of the data I would use shasum or md5sum. Since this only lets the user know if they differ, I would then use diff to figure out how the files differ.
wget
and curl
are two common methods for downloading data via the command line from HTTP and FTP addresses. These both work well for moderately sized files over non-secure connections. rsync
and scp
are also options for secure file downloads for larger files. Rsync is well-equipped for large data transfers, and can be used to check that all of the contents of a directory have been transferred.
shasum
and md5sum
can be used to check for differences between files. This can also be automated using shasum's check option (-c) that will report which files are different, so each hexadecimal code produced doesn't need to be visually inspected. To determine how files differ and not just if they differ, diff
can be used to give the locations of differences between files.
If dealing with relatively small files, the best commands for downloading data are curl
and wget
. The utility of one command over the other depends on your security requirements and the file structure you are trying to download from.
wget
has recursive behavior that allows it to follow links and retrieve files that are not directly linked in the url passed to wget
. In the case of many subdirectories that each contain one file that you need to download, the recursive download action ofwget
can easily pull out just those files that you need.
curl
allows you to simultaneously download and translate files to a standard output. It also has the ability to download files via more secure protocols than wget
like SFTP and SCP.
Larger files, such as WGS datasets, are better downloaded via rsync
, which only downloads the difference between a source file and the specified destination. This ensures faster download speeds, and a second rsync
can be used as a quick checkpoint to make sure that a download proceeded as expected.
After a large download, I would use checksum commands shasum
or md5sum
to compare the hexadecimal checksum scores of my downloaded files to their original scores. Any difference between these sums indicates a download error that I can investigate using diff
.
We can use both wget
and curl
to download large datasets. These work for FTP and HTTP files, although curl
can also be used for SFTP and SCP protocols. wget
is better suited for quick downloads and recursive downloads.
We can check the integrity of the downloaded file by using the SHA (shasum
) and MD5 (md5sum
) checksum functions, which compare the hexadecimal score of the original and downloaded files to see if they differ. To investigate differences, we can use diff
to find what the difference is.
wget
let you download data from HTTP and FTP servers. You provide it a link and wget
will download from the link you point it to. wget
can also download data recursively, which means it will follow/download the pages linked to, and follow/download links on these pages. But be careful of this recursive feature since it might overload the remote server. curl
can transfer files using more protocols than wget
, including SFTP (secure FTP) and SCP (secure copy). By default, curl
writes the file to standard output. Rsync
is good for heavy-duty tasks. It only sends difference between file versions (while a copy already exists or partially exists). Rsync
is an excellent choice for network backups of entire directories.
shaum
and md5sum
. diff
can be used to see how files differ. I would probably use curl, then a checksum command to confirm the file's integrity.
How would you download a large data set at the command line and ensure the integrity of the data was maintained (i.e. the file you downloaded is the exact same as on the server)?