rBatt / trawlData

Collate and clean bottom trawl survey data
10 stars 2 forks source link

reduce package size; be selective with columns retained #64

Closed rBatt closed 8 years ago

rBatt commented 8 years ago

Need to make the package more lightweight by reducing the size of the associated data sets.

One step here is to drop extra columns where possible.

There are 3 basic approaches I'm going to take:

  1. drop columns that I know are redundant; e.g., regions have date and time separate, and I can drop those 2 after I create datetime
  2. drop columns that I added and I think aren't useful (some of the taxonomic stuff)
  3. drop columns that are derived from the data providers, but I don't think are that useful (e.g., CATCHJOIN or ALTERATIONDESC).

That 3rd category is tricky, because it represents a loss of information relative to what is provided in raw data. That's what I want feedback on in this Issue: which columns from the raw data do I need to keep?

Below I'll make a list of columns, organized under a few categories. I'm open to any feedback as to which columns would be needed; I can add more options if something is suggested that I don't have, but I'll use checking a box as a way of indicating that I intend to keep the column. The goal is to have the package contain only 1 data set per region, and raw data available by download (possibly via a package function). In other words, if a column isn't included here, it won't be easily accessible elsewhere.

Most of the following columns will have the same name in all regions. Or there will be a similar equivalent in the regions that have it. If editing this list and adding a column that only needs to be included for a particular region and doesn't need to be included for other regions even if the column exists, please specify which region.

Time and Location of Sample

Species ID and Characteristics

Additional Method Metadata

Environmental and Sample Data

Biological Measurements

Other

Many of the columns don't have values that change among every row. In particular, many of the "meta data" columns don't vary within a haul, and the species taxonomy columns don't change at all (across species or regions). Just like we save all the species taxonomy (etc) information in they spp.key data.table, we could save many of the haul- or cruise- specific information in separate data.tables. In fact, many of the raw data sets arrive in such a format, where environmental, survey, and biological data are separated. While this makes it less convenient to access the data, it makes it so that we can provide more information while staying under CRAN size limits. So there is definitely room to compromise.

rBatt commented 8 years ago

I just did some quick play with subsetting columns etc. I think there might be not a lot of benefit to dropping columns that are highly repeated, because the .RData compression (I'm using "xz") is probably able to compress most of this repeated information quite well.

Something to keep in mind.

rBatt commented 8 years ago

We are just going to move the data files outside of the package, so this no longer needs to be done