reduce package size; be selective with columns retained

rBatt commented 8 years ago

Need to make the package more lightweight by reducing the size of the associated data sets.

One step here is to drop extra columns where possible.

There are 3 basic approaches I'm going to take:

drop columns that I know are redundant; e.g., regions have date and time separate, and I can drop those 2 after I create datetime
drop columns that I added and I think aren't useful (some of the taxonomic stuff)
drop columns that are derived from the data providers, but I don't think are that useful (e.g., CATCHJOIN or ALTERATIONDESC).

That 3rd category is tricky, because it represents a loss of information relative to what is provided in raw data. That's what I want feedback on in this Issue: which columns from the raw data do I need to keep?

Below I'll make a list of columns, organized under a few categories. I'm open to any feedback as to which columns would be needed; I can add more options if something is suggested that I don't have, but I'll use checking a box as a way of indicating that I intend to keep the column. The goal is to have the package contain only 1 data set per region, and raw data available by download (possibly via a package function). In other words, if a column isn't included here, it won't be easily accessible elsewhere.

Most of the following columns will have the same name in all regions. Or there will be a similar equivalent in the regions that have it. If editing this list and adding a column that only needs to be included for a particular region and doesn't need to be included for other regions even if the column exists, please specify which region.

Time and Location of Sample

[x] reg
[x] year
[ ] season
[x] datetime
[x] lon
[x] lat
[x] stratum (the region's definition, not my custom definition)
[x] haulid

Species ID and Characteristics

[x] spp
[x] common
[x] sex
[x] taxLvl
[x] trophicLevel
[x] trophicLevel.se

Additional Method Metadata

[ ] station
[ ] cruise
[ ] vessel
[ ] towduration
[ ] towarea
[ ] gearsize
[ ] geartype
[ ] comments
[ ] survey (e.g., summer groundfish)

Environmental and Sample Data

[ ] effort
[ ] stratumarea
[x] btemp
[x] stemp
[x] depth
[ ] bsalin
[ ] ssalin
[ ] bdo
[ ] sdo
[ ] wind
[ ] wave
[ ] pressure

Biological Measurements

[ ] cnt
[ ] weight
[ ] length
[x] cntcpue
[x] wtcpue
[ ] NUMLEN (neus only)

Other

[x] keep.row
[x] row_flag

Many of the columns don't have values that change among every row. In particular, many of the "meta data" columns don't vary within a haul, and the species taxonomy columns don't change at all (across species or regions). Just like we save all the species taxonomy (etc) information in they spp.key data.table, we could save many of the haul- or cruise- specific information in separate data.tables. In fact, many of the raw data sets arrive in such a format, where environmental, survey, and biological data are separated. While this makes it less convenient to access the data, it makes it so that we can provide more information while staying under CRAN size limits. So there is definitely room to compromise.

rBatt commented 8 years ago

I just did some quick play with subsetting columns etc. I think there might be not a lot of benefit to dropping columns that are highly repeated, because the .RData compression (I'm using "xz") is probably able to compress most of this repeated information quite well.

Something to keep in mind.

rBatt commented 8 years ago

We are just going to move the data files outside of the package, so this no longer needs to be done

rBatt / trawlData

reduce package size; be selective with columns retained #64