sr320 / course-fish546-2018

7 stars 2 forks source link

Getting Data #13

Closed sr320 closed 5 years ago

sr320 commented 5 years ago

How would you download a large data set at the command line and ensure the integrity of the data was maintained (i.e. the file you downloaded is the exact same as on the server)?

magobu commented 5 years ago

For downloading big data, I could use two command-line programs, wget or curl. For quick downloads of HTTP or FTP files I would use wget, if I wanted to download files using secure protocols (SFTP or SCP) I would use curl. Also, if it is to my advantage to get the downloaded file written to standard output (not quite sure what this means?), I would use curl. There other advantages to using curl, like it can follow page redirects, and it is also a library.

To check data integrity, I would use either of the two checksum algoriths, SHA-1 or MD5 to check if downloaded files are different. To check for specific difference among files I would use the Unix too diff

grace-ac commented 5 years ago

To download large data, I would use wget or curl, depending on what kind of files they are. If they are HTTP or FTP, I'd use wget. If they are other file types, I'd use curl.

If my data download finished without any error notifications, I would check data integrity, using the checksums, SHA-1. It compares the downloaded file to the file that you sourced it from. If there is any difference between the two files, the output for SHA-1 will be completely different between the two files. You want to have the same output, that way you know that you successfully downloaded the data in its entirety with no mistakes. Mistakes can happen with downloading large data sets because they take a long time, which increases the risk of dropped network connections during the process, which could result in data loss.

yaaminiv commented 5 years ago

Download data: For HTTP or FTP links, I could use wget or curl. curl can also be used to securely download files using SFTP and SCP protocols.

Ensure integrity was maintained: Use checksums (either shasum or md5sum). If the checksum for the downloaded file is different than it should be, the file could have been corrupted during the download.

kimh11 commented 5 years ago

Download data

Check integrity of data

  1. Extract FASTA header using grep/zgrep "^>" to check that everything downloaded
  2. Compare CHECKSUM values of remote vs local copies
  3. If CHECKSUM values do not match, use diff/zdiff to find the difference
Jeremyfishb commented 5 years ago

To download data, use curl or wget depending on file types. To ensure integrity, use a checksum and compare the downloaded data's code with source data's code. If they don;t match, use diff to find what the difference is.

kcribari commented 5 years ago

You can download data from the command line in one of two ways. First is to use wget, and second is curl. Wget downloads your data from the command line and puts it in your current directory. It is useful for http files and ftp files. Curl is similar, but it converts the file to standard output.

The integrity of the data is checked using checksums. They are compressed versions of the data that will be able to tell you in something in the file has been changed. The two most common checksums are SHA and MD5. SHA-1 is a new version of checksum than MD5.

jgardn92 commented 5 years ago

I would download data from HTTP and FTP files using either wget or curl and from SFTP and SCP file types using only curl. wget downloads data to the current directory while curl downloads data to the standard output so you usually redirect it. To check data integrity I would use a checksum algorithm,: either SHA-1 using the program shasum or MD5 using the program md5sum. This will tell me if my downloaded file differs from the original data. If they don't match the diff command will show me the lines that differ between the files (plus however many lines above and below the difference that I ask it to show).

hgloiselle commented 5 years ago

For downloading data I would use either wget or curl. In order to check integrity of the data I would use shasum or md5sum. Since this only lets the user know if they differ, I would then use diff to figure out how the files differ.

zscooper commented 5 years ago

wget and curl are two common methods for downloading data via the command line from HTTP and FTP addresses. These both work well for moderately sized files over non-secure connections. rsync and scp are also options for secure file downloads for larger files. Rsync is well-equipped for large data transfers, and can be used to check that all of the contents of a directory have been transferred.

shasum and md5sum can be used to check for differences between files. This can also be automated using shasum's check option (-c) that will report which files are different, so each hexadecimal code produced doesn't need to be visually inspected. To determine how files differ and not just if they differ, diff can be used to give the locations of differences between files.

wsano16 commented 5 years ago

If dealing with relatively small files, the best commands for downloading data are curl and wget. The utility of one command over the other depends on your security requirements and the file structure you are trying to download from.

Larger files, such as WGS datasets, are better downloaded via rsync, which only downloads the difference between a source file and the specified destination. This ensures faster download speeds, and a second rsync can be used as a quick checkpoint to make sure that a download proceeded as expected.

After a large download, I would use checksum commands shasum or md5sum to compare the hexadecimal checksum scores of my downloaded files to their original scores. Any difference between these sums indicates a download error that I can investigate using diff.

calderatta commented 5 years ago

We can use both wget and curl to download large datasets. These work for FTP and HTTP files, although curl can also be used for SFTP and SCP protocols. wget is better suited for quick downloads and recursive downloads.

We can check the integrity of the downloaded file by using the SHA (shasum) and MD5 (md5sum) checksum functions, which compare the hexadecimal score of the original and downloaded files to see if they differ. To investigate differences, we can use diff to find what the difference is.

melodysyue commented 5 years ago

Download big data

laurahspencer commented 5 years ago

I would probably use curl, then a checksum command to confirm the file's integrity.