Closed philippkraft closed 4 months ago
xyz.conf
...
sample_parser = xyz-parser.yml
...
xyz-parser.yml, for a sample pattern like SiteName-123121-TOC
site:
pattern: ^(\w+?)-
output: \1
date:
pattern: -([0-9\.]+?)-
output: \1
format: %d%m%y
site-map:
SiteName: 87
OtherSite: 93
Analysis carried out in a lab usually result in Excel files with a somewhat encoded sample name with multiple values per sample. If multiple values for the same dataset and the same time exist, this method is building the mean value
The samples in one batch can origin from multiple sites. Such files can be described with .labimport
file. This import
is ment for rather small datafiles, like a couple of hundred rows, as it is rather slow.
The .labimport
file consists of two main parts: The decription of the file format eg. how to interprete the file as a table
and the decription of the meaning of each column. For each value in the table a fitting dataset must exist. Creating
a .labimport
file is difficult and should be mentored by a person with programming experience.
The lab import uses the file reading capacity of the Python library pandas.
The keyword driver is the pandas function to read the tabular data, and driver-options are the keywords of that function.
If you know how to load tables in R
- this is quite similar.
Reads excel files (.xlsx, .xlsb, .xls). Typical keywords are sheet_name
and index_col
. If your data is in the only
worksheet and all data is well formatted, you do not need any options. If you have multiple header lines use skiprows
to skip those.
Reads tabular text files with a seperator. Default is ,
as the variable seperator and .
as the
decimal sign. This can be changed with sep: ";"
and decimal: ","
, the quotes are important here.
If you have date/time column, make sure to use the date parsing options of read_csv with the keywords date_format
and parse_dates, as described in the link. If the date of the sample can be inferred from the sample name, this is not necessary.
skiprows
, nrow
, usecols
and skipfooter
are keywords that can be used in every table format to ignore
unnecessary data. na_values
is used to name conventions of N/A-Values in the file. If the file does not have column headers,
or the column headers must be skipped, please indicate it with headers: null
or give each column its own name
Each column is addressed by its name (or in few cases postion). The columns are treated differently by its type.
This is the standard type, and there can be multiple value columns in a file. Value columns contain the value and the valuetype (as id number, eg 1 for Air Temperature in °C). To scale values for measurements for units a factor can be given (optional).
If a column with the time of sampling is given, that column is marked as type: time
. A well formatted excel column
is directly understood as as a date/time field, csv files often needs a translation with the file format (see above).
Do not include a time column, if the time should be inferred from the sample name.
A column with the target dataset id. The dataset is often inferred from additional information (site, level, time, valuetype), and then no dataset column is given. The additional information can be given in explicit columns or can be encoded in the sample name. If the dataset is also encoded in the sample, this column is taken first
A column with the site of sampling. If the dataset is given or if the site is inferred from the sample, do not specify a site column
A column indicating the level (eg. depth) of a sampling. If the dataset is given or the level is encoded in the sample, this is not necessary
This is a very complex type and can be used to derive metadata like site, time, depth from a sample name (this is called parsing).
If additional columns for this information exists already, use the type samplename
, then no parsing takes place and the
name is only copied. This is not easy - if you are no programmer, you might need help from someone with programming kno
The sample column contains a regex-pattern to describe the content of the sample name. Regular expression (regex) is a mini language to find parts in a text. Groups are in parenthesis and numbers, alphanumeric characters etc can be used to identifiy the parts. You can test a regex beforehand here: https://regex101.com/. The pattern in the example file
(\w+?)_([0-9\.]+_[0-9]+\:[0-9]+)_?([-+]?[0-9\.]+)?
means there are three groups, each seperated with underscore
1) (\w+?) 2) ([0-9.]+_[0-9]+\:[0-9]+) 3) ([-+]?[0-9.]+)?
This pattern matches sample names like F1_4.12.2023_12:54_60 and can mean a sample take at site F1 in December 4th 2023 at 12:54 in a depth of 60 cm.
To derive the site, date and level, the parts must be translated. The date can be translated by the group number (in the example 2) and the date format, in the example %d.%m.%Y_%H:%M using Pythons notation of date formats (which is used by a number of other programming languages). The site needs also the right group (in the example 1), and you can add a map to translate site names into site id's of the database. If your data uses already the official site id's, that map can be left out.
The level is derived from group 3 in our example and can be transformed with a factor. In the example the level 60 for 60cm depth, we use the factor -0.01 to translate it as depth (negative number) and in m. '60' -> -0.6.
If additional markers are needed, please post an issue in github (https://github.com/jlu-ilr-hydro/odmf/issues)
If the samples have been diluted, the row-factor in this column can be used to scale all values in that row.
driver: read_excel # pandas function to read the table. See: https://pandas.pydata.org/docs/reference/io.html
driver-options: # keyword args for the pandas function. See: https://pandas.pydata.org/docs/reference/io.html
sheet_name: Table 1 # Sheet name
columns: # Description of each column, use the column name as object name
Sample: # Name of the sample column, in this case Sample. You can also
type: sample # Necessary to understand column as sample names
pattern: (\w+?)_([0-9\.]+_[0-9]+\:[0-9]+)_?(?:[-+]?[0-9\.]+)
site: # site part of the sample
group: 1 # regex group
map: # a map to translate site names to site id's
F1: 137
F2: 147
F3: 201
B1: 123
B2: 138
B3: 203
date: # the date part of the regex
group: 2 # the regex group
format: "%d.%m.%y_%H:%M"
level:
group: 3
factor: -0.01
N_NO3: # The name
type: value
factor: 1.0
valuetype: 3
N_NH4:
type: value
factor: 14.3
valuetype: 4
Seems to work, please reopen when needed
Replaces #9
Use an additional file to parse sample names into dates / sites