sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

fix post harvest tests #471

Closed jacobthill closed 5 months ago

jacobthill commented 6 months ago

The post-harvest code and tests broke after we switched to json; they were only operating on the csv files. This should fix that issue.

jacobthill commented 5 months ago

The codeclimate failure here is related to the remove_*_non_relevant methods. There are 3 (I believe) instances of essentially identical code. This should be moved into a single remove_non_relevant method (perhaps in the read_df utility) in order to clear up the code duplication here.

I saw that and started to work through it but I got confused on how to implement it. We pass the name of the post_harvest function in the catalog e.g. remove_walters_non_relevant then we import it in the post_harvest.py task

from dlme_airflow.utils.walters import (
    remove_walters_non_relevant,
)

If we have one function that takes the catalog kwargs and its called remove_non_relevant then we import it in the post_harvest.py task

from dlme_airflow.utils.read_df import (
    remove_non_relevant,
)

how do the various filtering functions get passed? I get that we can pass a function as an argument to remove_non_relevant but I'm confused on how to implement all of this. I think I will need help.

jacobthill commented 5 months ago

The codeclimate failure here is related to the remove_*_non_relevant methods. There are 3 (I believe) instances of essentially identical code. This should be moved into a single remove_non_relevant method (perhaps in the read_df utility) in order to clear up the code duplication here.

I saw that and started to work through it but I got confused on how to implement it. We pass the name of the post_harvest function in the catalog e.g. remove_walters_non_relevant then we import it in the post_harvest.py task

from dlme_airflow.utils.walters import (
    remove_walters_non_relevant,
)

If we have one function that takes the catalog kwargs and its called remove_non_relevant then we import it in the post_harvest.py task

from dlme_airflow.utils.read_df import (
    remove_non_relevant,
)

how do the various filtering functions get passed? I get that we can pass a function as an argument to remove_non_relevant but I'm confused on how to implement all of this. I think I will need help.

I guess the filter function that gets passed to remove_non_relevant would also have to be in the catalog so it gets passed with kwargs?

aaron-collier commented 5 months ago

@jacobthill you would: 1 - Consolidate the *_non_relavant code into one method 2 - Update the includes in https://github.com/sul-dlss/dlme-airflow/blob/main/dlme_airflow/tasks/post_harvest.py#L20 to remove the individual calls in favor of the single method 3 - update the catalog configs with the relevant method name (i.e. change https://github.com/sul-dlss/dlme-airflow/blob/main/catalogs/walters.yaml#L11 from remove_walters_non_relevant to remove_non_relevant ) because we call the method based on the config (https://github.com/sul-dlss/dlme-airflow/blob/main/dlme_airflow/tasks/post_harvest.py#L46)

That should do it I think.

jacobthill commented 5 months ago

@jacobthill you would: 1 - Consolidate the *_non_relavant code into one method 2 - Update the includes in https://github.com/sul-dlss/dlme-airflow/blob/main/dlme_airflow/tasks/post_harvest.py#L20 to remove the individual calls in favor of the single method 3 - update the catalog configs with the relevant method name (i.e. change https://github.com/sul-dlss/dlme-airflow/blob/main/catalogs/walters.yaml#L11 from remove_walters_non_relevant to remove_non_relevant ) because we call the method based on the config (https://github.com/sul-dlss/dlme-airflow/blob/main/dlme_airflow/tasks/post_harvest.py#L46)

That should do it I think.

What about the filter_df method that is called in remove_walters_non_relevant etc.? That logic is different in each post harvest task so remove_non_relevant would need to be passed a function as an argument. I assume that needs to go in the catalog so that its accessible via kwargs?

jacobthill commented 5 months ago

This pattern is similar in walters, qnl, yale, and princeton. Each of them has a helper function that would need to be passed to remove_non_relevant.