Getting Data - Githubissues

sr320 commented 5 years ago

How would you download a large data set at the command line and ensure the integrity of the data was maintained (i.e. the file you downloaded is the exact same as on the server)?

magobu commented 5 years ago

For downloading big data, I could use two command-line programs, wget or curl. For quick downloads of HTTP or FTP files I would use wget, if I wanted to download files using secure protocols (SFTP or SCP) I would use curl. Also, if it is to my advantage to get the downloaded file written to standard output (not quite sure what this means?), I would use curl. There other advantages to using curl, like it can follow page redirects, and it is also a library.

To check data integrity, I would use either of the two checksum algoriths, SHA-1 or MD5 to check if downloaded files are different. To check for specific difference among files I would use the Unix too diff

grace-ac commented 5 years ago

To download large data, I would use wget or curl, depending on what kind of files they are. If they are HTTP or FTP, I'd use wget. If they are other file types, I'd use curl.

If my data download finished without any error notifications, I would check data integrity, using the checksums, SHA-1. It compares the downloaded file to the file that you sourced it from. If there is any difference between the two files, the output for SHA-1 will be completely different between the two files. You want to have the same output, that way you know that you successfully downloaded the data in its entirety with no mistakes. Mistakes can happen with downloading large data sets because they take a long time, which increases the risk of dropped network connections during the process, which could result in data loss.

yaaminiv commented 5 years ago

Download data: For HTTP or FTP links, I could use wget or curl. curl can also be used to securely download files using SFTP and SCP protocols.

Ensure integrity was maintained: Use checksums (either shasum or md5sum). If the checksum for the downloaded file is different than it should be, the file could have been corrupted during the download.

kimh11 commented 5 years ago

Download data

Use wget or curl to download from http or ftp (curl can download files using SFTP and SCP)
Use Rsync if copying multiple files across a network (will only send differences between files)

Check integrity of data

Extract FASTA header using grep/zgrep "^>" to check that everything downloaded
Compare CHECKSUM values of remote vs local copies
If CHECKSUM values do not match, use diff/zdiff to find the difference

Jeremyfishb commented 5 years ago

To download data, use curl or wget depending on file types. To ensure integrity, use a checksum and compare the downloaded data's code with source data's code. If they don;t match, use diff to find what the difference is.

kcribari commented 5 years ago

You can download data from the command line in one of two ways. First is to use wget, and second is curl. Wget downloads your data from the command line and puts it in your current directory. It is useful for http files and ftp files. Curl is similar, but it converts the file to standard output.

The integrity of the data is checked using checksums. They are compressed versions of the data that will be able to tell you in something in the file has been changed. The two most common checksums are SHA and MD5. SHA-1 is a new version of checksum than MD5.

jgardn92 commented 5 years ago

I would download data from HTTP and FTP files using either wget or curl and from SFTP and SCP file types using only curl. wget downloads data to the current directory while curl downloads data to the standard output so you usually redirect it. To check data integrity I would use a checksum algorithm,: either SHA-1 using the program shasum or MD5 using the program md5sum. This will tell me if my downloaded file differs from the original data. If they don't match the diff command will show me the lines that differ between the files (plus however many lines above and below the difference that I ask it to show).

hgloiselle commented 5 years ago

For downloading data I would use either wget or curl. In order to check integrity of the data I would use shasum or md5sum. Since this only lets the user know if they differ, I would then use diff to figure out how the files differ.

zscooper commented 5 years ago

wget and curl are two common methods for downloading data via the command line from HTTP and FTP addresses. These both work well for moderately sized files over non-secure connections. rsync and scp are also options for secure file downloads for larger files. Rsync is well-equipped for large data transfers, and can be used to check that all of the contents of a directory have been transferred.

shasum and md5sum can be used to check for differences between files. This can also be automated using shasum's check option (-c) that will report which files are different, so each hexadecimal code produced doesn't need to be visually inspected. To determine how files differ and not just if they differ, diff can be used to give the locations of differences between files.

wsano16 commented 5 years ago

If dealing with relatively small files, the best commands for downloading data are curl and wget. The utility of one command over the other depends on your security requirements and the file structure you are trying to download from.

wget has recursive behavior that allows it to follow links and retrieve files that are not directly linked in the url passed to wget. In the case of many subdirectories that each contain one file that you need to download, the recursive download action ofwget can easily pull out just those files that you need.
curl allows you to simultaneously download and translate files to a standard output. It also has the ability to download files via more secure protocols than wget like SFTP and SCP.

Larger files, such as WGS datasets, are better downloaded via rsync, which only downloads the difference between a source file and the specified destination. This ensures faster download speeds, and a second rsync can be used as a quick checkpoint to make sure that a download proceeded as expected.

After a large download, I would use checksum commands shasum or md5sum to compare the hexadecimal checksum scores of my downloaded files to their original scores. Any difference between these sums indicates a download error that I can investigate using diff.

calderatta commented 5 years ago

We can use both wget and curl to download large datasets. These work for FTP and HTTP files, although curl can also be used for SFTP and SCP protocols. wget is better suited for quick downloads and recursive downloads.

We can check the integrity of the downloaded file by using the SHA (shasum) and MD5 (md5sum) checksum functions, which compare the hexadecimal score of the original and downloaded files to see if they differ. To investigate differences, we can use diff to find what the difference is.

melodysyue commented 5 years ago

Download big data

wget let you download data from HTTP and FTP servers. You provide it a link and wget will download from the link you point it to. wget can also download data recursively, which means it will follow/download the pages linked to, and follow/download links on these pages. But be careful of this recursive feature since it might overload the remote server.
curl can transfer files using more protocols than wget, including SFTP (secure FTP) and SCP (secure copy). By default, curl writes the file to standard output.
Rsync is good for heavy-duty tasks. It only sends difference between file versions (while a copy already exists or partially exists). Rsync is an excellent choice for network backups of entire directories.
Ensure data integrity
CHECKSUM can be used to check if files are different. Two checksum algorithms are commonly used, shaum and md5sum.
diff can be used to see how files differ.

laurahspencer commented 5 years ago

I would probably use curl, then a checksum command to confirm the file's integrity.

sr320 / course-fish546-2018

Getting Data #13

Download big data

Ensure data integrity