plazi / BLR-website

1 stars 0 forks source link

fix data source for collectionCode facet #12

Open tcatapano opened 4 years ago

tcatapano commented 4 years ago

For treatment facets, get single rather than aggregated collection codes from treatmentBank data

punkish commented 4 years ago

I am going to tackle this, but it might require surgery at source. As I've mentioned before, my XML parser simply grabs the attributes as they are. In this case, the collection codes are coming as a comma separated list, so that is how they are inserted. In order for me to insert one collection code per row, I have to not just break up the string into individual tokens, I have to also de-duplicate after the entire corpus has been processed. Will require modification of the data conversion program, so will take longer.

                     ┌────────────────────┐                                                         
                     │ materialsCitations │                                                         
┌─────────────┐      ├────────────────────┤      ┌────────────────────┐       ┌────────────────────┐
│ treatments  │      │materialsCitationId │──────▶materialsCitationId │       │  collectionCodes   │
├─────────────┤      ├────────────────────┤      ├────────────────────┤       ├────────────────────┤
│ treatmentId ├──────▶    treatmentId     │      │  collectionCodeId  ◀───────│  collectionCodeId  │
├─────────────┤      ├────────────────────┤      └────────────────────┘       ├────────────────────┤
│      …      │      │         …          │                                   │   collectionCode   │
└─────────────┘      └────────────────────┘                                   └────────────────────┘
punkish commented 4 years ago

so, as I said above, I take out collectionCodes from the treatments just as they are provided. And they are provided as a csv list. For example, see below

{ materialsCitationId: '3B463C91FFF5FF944BEFFA4FFF53F9AF',
       treatmentId: '038787DAFFF7FF904BBFF925FD13F9AA',
       collectingDate: '2018-08-22',
       collectionCode: 'HUNAU, KSU',
       collectorName: 'Rearing & I. Ohshima & C. Q. Liao',
       country: 'China',
       collectingRegion: '',
       municipality: 'Huameiguan',

I think I should not be parsing them (though I could, albeit not very easily). In fact, like the status, this too is something that should be fixed at the source, either when GGI is used to do the tagging, or when a treatment is created and stored. See the comment by @myrmoteras in this regard.