pbs-assess / data-stewardship-unit

0 stars 0 forks source link

Publish a PROOF OF CONCEPT dashboard/report of common queries made to the Fishery Observing System using Fabric and Power BI #15

Closed Br-Johnson closed 3 days ago

Br-Johnson commented 1 month ago

Background Info (Notes from initial meeting with Joey, Jason, and Brett):

Jason: Most common data request is an in season commercial in season catch estimates. Data are published to open data. Broken up by gear type. Internal data. If less than 3 vessels are out then the data are restricted because people could ID fisherman if there's only 3 vessels out. Data are provided by calendar year. 2005-2023 is available.

In season estimates are produced by fishery managers, who produce these estimates based on a number of different data sources (including mainly Fishery reported data, but also fly overs etc.) Based on in season estimates, not log book reconciled after season. End of season report data available on Open Government Portal from 1994-2004.

Active Fishing Vessels by Calendar Year Query

Commercial Openings Query (Constitute 25% of his total data requests, takes half an hour each time)

Commercial Catch Query

In summary

Plan is to take all raw FOS data imported into Pacific Science Sandbox in Fabric and recreate Jason's SQL queries to get a dataset(s) that could support the three queries to build a Power BI Dashboard with that Jason could use to make data requests, and to send to internal stakeholders for self-serve data access.

Need to be very mindful of privacy concerns as the dataset contains Protected B data. Must not be shared externally unless data are rolled up to aggregate vessels and there are more than three vessels.

Data need to be promoted to Silver layer in Fabric, and eventually Gold layer once all data quality checks completed.

Might be nice to have three different views (tabs) of the data in the dashboard that represent Commercial Catch, Commercial Openings, Number of Active Fishing Vessels.

Br-Johnson commented 3 weeks ago

Hey @AnnaD-M Any thoughts on next steps to try and get an MVP out as a demo? A nice endpoint for this project would be to have a few slides describing the steps involved: 1) cloning oracle db into Fabric Lakehouse (Bronze), 2) transforming the data via spark notebook (silver); 3) orchestrating workflow updates; 4) proof of concept FOS Dashboard App for common queries supporting high level decisions, with user-defined grouping and filtering.

I think we should get an MVP out before testing and validating the query and promoting it to a gold level. For that step we will likely need to re-write the transformation notebook (possibly in SQL for easier comparison with Jasons queries and easier maintainability into the future, also could consider using R) and build a star schema as a demo of best practices.

AnnaD-M commented 2 weeks ago

Re: Commercial Catch Query: 'Include ability to select in-season estimates or fisher reported logbook data'

Quick explanation from Jason: Fisher reported catch is data called into a call centre and reported by a fisherman or data entered into a logbook. The in-season estimates are produced by fishery managers. They are based on the fisher reported catch but take other data sources into account. Like overflight data. Validations. Interviews etc. A phone-in report can either be an ELog or a report verbally given to a call centre agent. A logbook report is data transcribed from the hard copy logbook by the service provider post season. The reports should... SHOULD.... be duplicates. BUT. In the event there is a discrepancy the logbook report is considered the true report.

Need to figure out how to resolve this. Currently being double counted in summary table. Need an IF/ELSE statement to retain phone-in if no logbook entry exists? Currently more phone-in records than logbook records...

Image