Sample dataset with treelist data

lemma-osu / synthetic-knn

Two-stage kNN imputation using synthetic data and treelists

0 stars 0 forks source link

Sample dataset with treelist data #2

Open grovduck opened 1 month ago

grovduck commented 1 month ago

We are planning that this package will integrate tightly with other packages from our group, including sknnr and sknnr-spatial. For those projects, we have relied on the southwest Oregon (SWO) ecology data for testing. Unfortunately, the SWO dataset does not have individual tree- (piece-) level information, but relies on cover estimates for different species on a plot.

For this project, we need to explicitly have treelist data for creating synthetic treelists. Treelist data is simply records of individually measured trees on forest inventory plots that capture attributes of those trees (include species, diameter at breast height, age, etc.) along with an expansion factor (typically either trees per acre or trees per hectare) that symbolizes the density of that record on a per-unit-area basis. This field is essential for expanding all other attributes (e.g. biomass, volume, etc.) to the same per-unit-area estimates.

The Forest Inventory and Analysis Database (FIADB) is the largest database of forest inventory information in the U.S. and is accessible through the FIA Data Mart. Comprehensive data (including treelist data) is available at the state and national levels in CSV and SQLite formats. These data are the basis for national forest estimates through such tools as EVALIDator and DATIM.

We propose using a full cycle of Oregon FIADB inventory data (2011-2020) on exclusively forested plots (i.e. single-condition plots in a forested condition). The sample dataset would consist of three different tables:

Treelist data: Individual piece-level data for all live trees 1" and above including attributes needed to apply the logic for synthetic treelist generation. Currently, this consists of: 1) unique tree identification code; 2) unique plot identification code on which the tree is found; 3) tree species; 4) tree diameter; and 5) tree expansion factor.
Species-by-size basal area table: This is a derived product of the treelist data that summarizes basal area by species and diameter groups and is used as the input species matrix to estimators such as GNNRegressor.
Environmental table: This is a pre-processed file that consists of extracted values of environmental and spectral covariates typically used in kNN workflows. It is important to note that we would use the coordinates from the FIADB PLOT table to extract covariate information and these coordinates are fuzzed from their actual locations and are potentially swapped with other plots in the database. So, although the data would come from actual covariates, results using this test dataset should not be used in any analysis.

A number of questions arise when considering the adoption of this approach:

The treelist data (FIADB table TREE) includes well over 100 fields. As detailed, we need a small subset of these fields for synthetic treelist creation, but it may be valuable to retain other fields to ensure that code correctly handles extraneous data. However, too many fields can make testing and examples slow. Should we subset the number of fields to a manageable number? The assumption is that we will have roughly 150,000 tree records across this sample.
Tree measurements and expansions in FIADB data are typically stored in imperial units (e.g. diameter in inches, tree density expressed in per-acre values). Traditionally, our workflows have used metric units. Should we first do this conversion and, if so, should we provide pre-processing code as part of this repository so that users could follow a similar workflow with other areas.
Related to 2, extraction of the covariates at plot locations is a separate analysis workflow that has little to do with the aims of this package. Is it valuable to document and provide code to replicate this process? I don't think we want to include the spatial data as part of this repository as it is large and of limited use outside the proposed sample dataset.

@aazuspan, curious to hear your thoughts on this approach or whether there is a better alternative for sample/testing data (with treelists) that I haven't thought about yet.

aazuspan commented 1 month ago

@aazuspan, curious to hear your thoughts on this approach or whether there is a better alternative for sample/testing data (with treelists) that I haven't thought about yet.

Sounds like a good plan to me! It's a shame there's no way to lean on the SWO dataset for consistency, but this seems like a reasonable alternative. I suppose if we wanted to maintain one shared dataset across all packages, you could theoretically add the compositional and environmental data as a new dataset in sknnr, a subset of the covariate grids in sknnr-spatial, and just the treelist in synthetic-knn? That seems like a lot of extra work without much practical benefit, though.

Some other thoughts below for you to take or leave!

Should we subset the number of fields to a manageable number? The assumption is that we will have roughly 150,000 tree records across this sample.

Subsetting fields to a representative set of data types to improve performance seems fair to me, with the understanding that your goal is to provide a demonstration dataset rather than accurate, unmodified FIA data. Out of curiosity, would it make sense to use a subset of plots to get the tree count down as well? I suppose that could be done easily enough within tests as well.

Tree measurements and expansions in FIADB data are typically stored in imperial units (e.g. diameter in inches, tree density expressed in per-acre values). Traditionally, our workflows have used metric units. Should we first do this conversion and, if so, should we provide pre-processing code as part of this repository so that users could follow a similar workflow with other areas.

I don't think I would be concerned with providing scripts for metric conversion since that's not a required step to use the package. Also, would it potentially be tricky to maintain for different FIA versions? I know the FIA protocols change, but not sure how consistent the overall data structure is, going back in time?

extraction of the covariates at plot locations is a separate analysis workflow that has little to do with the aims of this package. Is it valuable to document and provide code to replicate this process?

I would probably put this in the out-of-scope category as well, but at the same time, more examples and documentation are never a bad thing.

grovduck commented 1 month ago

I suppose if we wanted to maintain one shared dataset across all packages, you could theoretically add the compositional and environmental data as a new dataset in sknnr, a subset of the covariate grids in sknnr-spatial, and just the treelist in synthetic-knn? That seems like a lot of extra work without much practical benefit, though.

This is an intriguing thought. It would be nice in the sense that we could rely on those package's functions to bring in the data for synthetic-knn. I'm a bit torn in the sense that if we provide notebooks on how to use synthetic-knn and rely on dataset loading functions from sknnr and sknnr-spatial, users may not understand the required workflow to load their own data? But it also seems silly to replicate functionality in sknnr in this package for loading and testing this data.

What do you think about moving some of the functionality in sknnr.datasets._base into a utils type of space (specifically moving _load_dataset_from_csv_filenames into a public function)? Right now, everything is hinging on DATA_MODULE in there (which is specifically set to sknnr.datasets.data), but if we made that an argument and put into a public API, other folks could use that for getting their data into similar format? This might be feature creep and there's probably a good reason to avoid this. And that might even be more work that what you initially suggested!

Subsetting fields to a representative set of data types to improve performance seems fair to me, with the understanding that your goal is to provide a demonstration dataset rather than accurate, unmodified FIA data.

Yep, I like this idea. We make sure we have a variety of types represented, but narrow down the list substantially other than that.

Out of curiosity, would it make sense to use a subset of plots to get the tree count down as well? I suppose that could be done easily enough within tests as well.

My only thought for retaining the full set of tree records would be performance testing. But your point is well taken - we wouldn't need the full set to be part of the package's data and could stress test outside the test suite.

Thanks for the feedback on unit conversions and spatial extractions as well. I'll consider those to be out of scope.

aazuspan commented 1 month ago

I'm a bit torn in the sense that if we provide notebooks on how to use synthetic-knn and rely on dataset loading functions from sknnr and sknnr-spatial, users may not understand the required workflow to load their own data?

Interesting! I had assumed there would be something like a synthetic_knn.datasets.load_... function that would load everything for you in a format that could then be used to create synthetic plots. Were you thinking of something more granular where the data files are provided along with notebooks showing how to load them manually? Or am I misunderstanding entirely?

What do you think about moving some of the functionality in sknnr.datasets._base into a utils type of space (specifically moving _load_dataset_from_csv_filenames into a public function)? Right now, everything is hinging on DATA_MODULE in there (which is specifically set to sknnr.datasets.data), but if we made that an argument and put into a public API, other folks could use that for getting their data into similar format?

Yeah, I like the idea of exposing things that are general enough to be useful in a public API! I think my only concerns with generalizing _load_csv_data are the specific restrictions like requiring an integer index in the first column, and floats in the remaining columns, but if those work here and we made them clear in the API, I don't see anything wrong with that.

I'm not sure how big the dataset files you're thinking about here are, but presumably they might require something like pooch to load them on-the-fly?

My only thought for retaining the full set of tree records would be performance testing. But your point is well taken - we wouldn't need the full set to be part of the package's data and could stress test outside the test suite.

I suppose you could also take a similar approach to what we did in sknnr_spatial with a large_dataset=False param to give a variety of sizes? But if things load and run fast enough with the full dataset, maybe that's unnecessary.

grovduck commented 1 month ago

Interesting! I had assumed there would be something like a synthetic_knn.datasets.load_... function that would load everything for you in a format that could then be used to create synthetic plots. Were you thinking of something more granular where the data files are provided along with notebooks showing how to load them manually? Or am I misunderstanding entirely?

I had to think about this one for a while. I agree that it makes most sense to have a function that loads everything that is in a synthetic_knn.datasets module much like sknnr and sknnr_spatial. In those packages, we don't provide much guidance on how we derive those example datasets, but the focus is on using those datasets within the context of the package. I think that approach makes sense here too.

What I have started (and will create an issue/PR for shortly) is a notebook that steps users through downloading data from FIADB and creating the sample data that will reside at synthetic_knn.datasets.data. I'm not convinced that this notebook should be part of this package, but I think it makes sense to have it start here and move it out elsewhere if it is more general purpose.

Yeah, I like the idea of exposing things that are general enough to be useful in a public API! I think my only concerns with generalizing _load_csv_data are the specific restrictions like requiring an integer index in the first column, and floats in the remaining columns, but if those work here and we made them clear in the API, I don't see anything wrong with that.

Those are valid concerns, although I do think they would work in this context. I've opened an issue at lemma-osu/sknnr#68 to track this.

I'm not sure how big the dataset files you're thinking about here are, but presumably they might require something like pooch to load them on-the-fly?

In early experimentation with a full cycle of Oregon plots, the treelist file is 8MB. That seems manageable for storing in the repo, but I'm open to pooch if you think that would make these more future-proof.

I suppose you could also take a similar approach to what we did in sknnr_spatial with a large_dataset=False param to give a variety of sizes? But if things load and run fast enough with the full dataset, maybe that's unnecessary.

The only consideration here is that we'd need to subset all three files (the environmental matrix, the species matrix, and the treelist file) to have the same plot IDs, but that should be pretty easy to handle. We may have to get to the testing phase to see how slow this is and whether or not it's worth subsetting.

aazuspan commented 1 month ago

What I have started (and will create an issue/PR for shortly) is a notebook that steps users through downloading data from FIADB and creating the sample data that will reside at synthetic_knn.datasets.data. I'm not convinced that this notebook should be part of this package, but I think it makes sense to have it start here and move it out elsewhere if it is more general purpose.

This sounds like a great addition! As you say, it could be more generally useful outside of synthetic-knn, but will be great to be able to reference here.

In early experimentation with a full cycle of Oregon plots, the treelist file is 8MB. That seems manageable for storing in the repo, but I'm open to pooch if you think that would make these more future-proof.

Cool, I was expecting something larger based on the data from the Gist. I don't know if there's some rule of thumb, but 8MB doesn't seem unreasonable to include in the package.

We may have to get to the testing phase to see how slow this is and whether or not it's worth subsetting.

Agreed, maybe it won't be a problem at all.

grovduck commented 1 month ago

Cool, I was expecting something larger based on the data from the Gist. I don't know if there's some rule of thumb, but 8MB doesn't seem unreasonable to include in the package.

I had cheated a bit on the sample data for the gist, in that I created a master treelist of all forested FIA plots that could be used across all model regions / years. In this workflow, I think it's best to subset down the treelists to just those plots under analysis.

grovduck commented 1 week ago

@aazuspan, you may very well want to kill me given the work to expose the sknnr.dataset functions, but I'm thinking back to this comment you made earlier in the thread:

I suppose if we wanted to maintain one shared dataset across all packages, you could theoretically add the compositional and environmental data as a new dataset in sknnr, a subset of the covariate grids in sknnr-spatial, and just the treelist in synthetic-knn? That seems like a lot of extra work without much practical benefit, though.

I think you were right on here and it's taken me until now to hop on board.

My initial concern was that we would have no way to demonstrate to users how to use their own data. I do think it's still good that we've exposed the dataset functions because users can still use those functions to load their own data with the constraints enforced by those functions, ensuring that they should work correctly in sknnr. We also now have a notebook that guides users on how to preprocess FIA data for use with synthetic-knn. So that allays my concerns about users bringing data to these projects.

But now I'm thinking that the separation of the FIADB Oregon data into the separate packages is logically more appealing to me for reasons that were probably obvious to you before:

Putting the environmental and species matrix in sknnr allows us to use the dataset loading functions (which, of course, we've just made public!) and testing and gives us a dataset that is pretty directly comparable to data that we usually work with.
The spatial covariates that I've used in the FIADB Oregon dataset are a subset of the covariates in sknnr-spatial (with the exception of latitude and longitude - D'oh!). Therefore we could use that dataset for additional testing or examples in sknnr-spatial should we have the need.
We would only need to write a function for tree list loading in synthetic-knn and that has some decision points that would be unique to this type of data:
- Fields can/will be mixed types (for example, species will be alphanumeric) and we will want to deal with that. load_csv_data won't play nicely with those data.
- We considered the use of polars for treelist generation and stand summaries, so that could be experimental use in this project.
- Even though our examples might rely on sknnr, there is no reason that synthetic-knn needs to use sknnr as a dependency if we don't use the dataset loading functions here.
- You even suggested at one point that we don't necessarily need special dataset loading functions here if it's simple enough to do pd.read_csv or pl.read_csv for the treelists.

So I'm ready to face your horrible scorn - let me know what you think and forgive me for being slow on the uptake.

aazuspan commented 1 week ago

Don't worry, no scorn from me! As you said, exposing the dataset loading functions in sknnr was probably worthwhile, even if we don't end up using them here (and you did most of the work there, anyways!). Just to make sure I understand, the broad changes you're proposing to integrate this dataset across the "sknnr-verse" would be:

sknnr: Add a new dataset with the environmental and species matrix from the Oregon FIADB data.
synthetic-knn: Add a corresponding treelist dataset (would this also load the environmental and species matrices via sknnr, or would that be left up to the user?)
sknnr-spatial: No direct changes, as the SWO Ecoplot covariate grids are compatible with the environmental matrix.

Do I have that right?

We considered the use of polars for treelist generation and stand summaries, so that could be experimental use in this project.

Any thoughts on how to implement this? Something like an as_polars argument in the loading function? I guess if I/O wasn't a significant bottleneck, we could convert it to polars as needed. Might be worth some experimentation to see how much time a pl.scan_csv saves over eagerly reading with pandas and converting.

grovduck commented 1 week ago

Don't worry, no scorn from me!

Phew!

2. synthetic-knn: Add a corresponding treelist dataset (would this also load the environmental and species matrices via sknnr, or would that be left up to the user?)

Good question on the loading of environmental and species matrix. Technically, once we have created the CCATransformer, we don't need them around anymore, only the treelist data. Having them all together means that sknnr is a dependency of synthetic-knn (at least the dataset feature of it). But in sknnr-spatial, you're returning the matrices, so there is precedent for doing it that way as well. Perhaps an independent loader just for tree data and then a wrapper function for all three (env + spp + tree)? I clearly haven't thought this through all the way.

3. sknnr-spatial: No direct changes, as the SWO Ecoplot covariate grids are compatible with the environmental matrix.

The only possible change here I see is that we might want to bring LATITUDE and LONGITUDE back as covariates. But I think it's just as easy to remove those covariates from the environment matrix, so that would be my recommended workflow. It means editing the 00_preprocess_data.ipynb notebook, but that should be straightforward.

Any thoughts on how to implement this? Something like an as_polars argument in the loading function? I guess if I/O wasn't a significant bottleneck, we could convert it to polars as needed. Might be worth some experimentation to see how much time a pl.scan_csv saves over eagerly reading with pandas and converting.

Thinking about the work you did with synthetic-knn-censoring-demo, I had assumed that we would only support either pandas or polars for that functionality, but are you thinking about supporting both workflows? Or do you mean that a user could feed that treelist generation function either a pd.DataFrame or a pl.DataFrame and the former would be converted to polars internally?

I think I'm following your logic on pl.scan_csv vs. pd.read_csv + pl.from_pandas(df). If we went with the former in the dataset treelist loading function, would the return value be a pl.LazyFrame which would then be fed into the synthetic treelist generation function? That seems like a cool idea. (Now that I'm looking back at the polars code you wrote, I can see you've used pl.scan_csv here - I didn't catch that at first.)

I think I could definitely use your help with the treelist loading functions, especially if we bring polars into it. Are you comfortable with me putting the FIADB Oregon data into sknnr? I think we've decided that's a good idea.

aazuspan commented 4 days ago

Technically, once we have created the CCATransformer, we don't need them around anymore, only the treelist data. Having them all together means that sknnr is a dependency of synthetic-knn (at least the dataset feature of it).

Would building a CCATransformer be part of the typical synthetic-knn usage like it was in the 01_create_synthetic_plots notebook? If so, then it seems like depending on sknnr makes sense, but if not then I could see wanting to separate them.

Perhaps an independent loader just for tree data and then a wrapper function for all three (env + spp + tree)? I clearly haven't thought this through all the way.

This sounds like a good approach if there's an application for having the treelist without the corresponding env/spp data. I imagine this is how it might be organized internally anyways, with one function dedicated to loading the treelist only.

The only possible change here I see is that we might want to bring LATITUDE and LONGITUDE back as covariates. But I think it's just as easy to remove those covariates from the environment matrix, so that would be my recommended workflow.

Removing the covariates for simplicity seems reasonable to me!

Thinking about the work you did with synthetic-knn-censoring-demo, I had assumed that we would only support either pandas or polars for that functionality, but are you thinking about supporting both workflows? Or do you mean that a user could feed that treelist generation function either a pd.DataFrame or a pl.DataFrame and the former would be converted to polars internally?

Yeah, I was just thinking of the latter, if anything. Maintaining two separate APIs for the two dataframe types and keeping them in sync sounds like a huge headache. Even allowing both pd.DataFrame and pl.DataFrame and converting them might be more effort than it's worth since it would mean we'd need to depend on both... If we feel like we're committed on the polars choice, maybe we just stick with that and tell users how to convert if needed?

I think I could definitely use your help with the treelist loading functions, especially if we bring polars into it.

Happy to help out however!

Are you comfortable with me putting the FIADB Oregon data into sknnr? I think we've decided that's a good idea.

Definitely! Seems like a good next step.

grovduck commented 4 days ago

Would building a CCATransformer be part of the typical synthetic-knn usage like it was in the 01_create_synthetic_plots notebook? If so, then it seems like depending on sknnr makes sense, but if not then I could see wanting to separate them.

I think I'm getting a bit hung up on technically required versus practically required. For most of the scenarios I can imagine (and at least how we have the example notebooks written now), CCATransformer is definitely part of this workflow. But there's definitely a scenario where a user just has point data extracted at multiple covariates and wants to [densify | obscure | oversample as in #4] these data using one of the existing PointNetworks. This wouldn't use a transformer and wouldn't be dependent on sknnr.

Just as we've had different dependencies for the dataset feature, would it make sense to have an examples feature and require sknnr just for that feature. In the end, I'm making this a bigger deal than it needs to be given that sknnr isn't a big dependency on its own.

Even allowing both pd.DataFrame and pl.DataFrame and converting them might be more effort than it's worth since it would mean we'd need to depend on both... If we feel like we're committed on the polars choice, maybe we just stick with that and tell users how to convert if needed?

I am fully OK just moving forward with polars as the only option, until we run into some reason why this won't work.

Are you comfortable with me putting the FIADB Oregon data into sknnr? I think we've decided that's a good idea.

Definitely! Seems like a good next step.

Let's start there. I'll open a PR here soon.

aazuspan commented 4 days ago

Just as we've had different dependencies for the dataset feature, would it make sense to have an examples feature and require sknnr just for that feature.

This sounds like a good plan!

I am fully OK just moving forward with polars as the only option, until we run into some reason why this won't work.

Me too!

Let's start there. I'll open a PR here soon.

👍