impresso / impresso-pycommons

Python module with bits of code (objects, functions) highly reusable within impresso.
http://impresso-pycommons.rtfd.io/
GNU Affero General Public License v3.0
3 stars 3 forks source link

Compute manifest #90

Closed EmanuelaBoros closed 2 months ago

EmanuelaBoros commented 2 months ago

Hello,

I am having trouble generating the manifest.

pip install impresso-pycommons
Defaulting to user installation because normal site-packages is not writeable
Collecting impresso-pycommons
  Downloading impresso_pycommons-0.12.8.tar.gz (36 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: dask[complete] in /home/eboros/.local/lib/python3.11/site-packages (from impresso-pycommons) (2024.5.1)
Requirement already satisfied: distributed in /home/eboros/.local/lib/python3.11/site-packages (from impresso-pycommons) (2024.5.1)
Collecting boto (from impresso-pycommons)
  Downloading boto-2.49.0-py2.py3-none-any.whl.metadata (7.3 kB)
Requirement already satisfied: boto3 in /home/eboros/.conda/envs/myenv/lib/python3.11/site-packages (from impresso-pycommons) (1.34.127)
Collecting bs4 (from impresso-pycommons)
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Requirement already satisfied: docopt in /home/eboros/.local/lib/python3.11/site-packages (from impresso-pycommons) (0.6.2)
Requirement already satisfied: deprecated in /home/eboros/.conda/envs/myenv/lib/python3.11/site-packages (from impresso-pycommons) (1.2.14)
Requirement already satisfied: dkpro-cassis in /home/eboros/.local/lib/python3.11/site-packages (from impresso-pycommons) (0.9.1)
Requirement already satisfied: scikit-build in /home/eboros/.local/lib/python3.11/site-packages (from impresso-pycommons) (0.18.0)
Requirement already satisfied: cmake in /home/eboros/.local/lib/python3.11/site-packages (from impresso-pycommons) (3.30.2)
INFO: pip is looking at multiple versions of impresso-pycommons to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following yanked versions: 3.4.11.39, 3.4.17.61, 4.4.0.42, 4.4.0.44, 4.5.4.58, 4.5.5.62, 4.7.0.68
ERROR: Could not find a version that satisfies the requirement opencv-python==3.4.8.29 (from impresso-pycommons) (from versions: 3.4.0.14, 3.4.10.37, 3.4.11.41, 3.4.11.43, 3.4.11.45, 3.4.13.47, 3.4.15.55, 3.4.16.57, 3.4.16.59, 3.4.17.63, 3.4.18.65, 4.3.0.38, 4.4.0.40, 4.4.0.46, 4.5.1.48, 4.5.3.56, 4.5.4.60, 4.5.5.64, 4.6.0.66, 4.7.0.72, 4.8.0.74, 4.8.0.76, 4.8.1.78, 4.9.0.80, 4.10.0.82, 4.10.0.84)
ERROR: No matching distribution found for opencv-python==3.4.8.29

Python version:

python -V
Python 3.11.9

My usage:

from impresso_commons.versioning.compute_manifest import create_manifest
ModuleNotFoundError: No module named 'impresso_commons.versioning.compute_manifest'

Maybe the import has changed. I could not find in the documentation where is the new location.

Thanks

piconti commented 2 months ago

The line from impresso_commons.versioning.compute_manifest import create_manifest works ok on my side (python v3.11.5 and python v3.11.9), and I didn't move the import.

Are you sure that impresso_commons is correctly installed? Locally I have opencv-python v 4.9.0.80 so this might be the problem.

Maybe if you install first opencv-python and then pycommons?

I indeed need to change this requirement anyways

EmanuelaBoros commented 2 months ago
pip install impresso_commons
Requirement already satisfied: impresso_commons in /home/eboros/.local/lib/python3.11/site-packages (1.0.2)
pip freeze | grep opencv-python
opencv-python==4.10.0.84
python
Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from impresso_commons.versioning.compute_manifest import create_manifest
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'impresso_commons.versioning.compute_manifest'
>>> 

This happens on the cluster and on my machine.

EmanuelaBoros commented 2 months ago
ll /home/eboros/.local/lib/python3.11/site-packages/impresso_commons/versioning/
total 20
drwxr-xr-x  3 eboros DHLAB-unit   91 Jul 24 16:02 ./
drwxr-xr-x 11 eboros DHLAB-unit 4096 Jul 24 16:02 ../
drwxr-xr-x  2 eboros DHLAB-unit   94 Jul 24 16:02 __pycache__/
-rw-r--r--  1 eboros DHLAB-unit 7778 Jul 24 16:02 manifest_0.py
-rw-r--r--  1 eboros DHLAB-unit 5718 Jul 24 16:02 rebuilt_manifest_0.py
EmanuelaBoros commented 2 months ago

I see, I guess it does not install correctly. I will git pull and try like this.

piconti commented 2 months ago

The current version is 1.1.0, so yes it seems there is something wrong in the import, does it work by doing upgrade?

EmanuelaBoros commented 2 months ago
  1. First try
    
    config_dict = {
    "data_stage": "entities",
    "output_bucket": "42-processed-data-final/entities/entities_v1-0-3",
    "input_bucket": "22-rebuilt-final",
    "git_repository": "../impresso-semantic-enrichment-deployment/",
    "newspapers": ["marieclaire"],
    "temp_directory": "../impresso-semantic-enrichment-deployment/temp",
    "previous_mft_s3_path": "",
    "is_staging": True,
    "is_patch": False,
    "patched_fields": [],
    "push_to_git": False,
    "file_extensions": "jsonl.bz2",
    "log_file": "/local/path/to/log_file.log",
    "notes": """First NER/EL models: 2024-02-01, 
        - NER: hipe2020_model-stacked_release_2024-01-24-mdeberta_num_layers-2_attn_type_adatrans_n_heads-12_head_dims"
    "-128_pos_embed_sin_trans_dropout_0.45_fc_dropout0.4_pool_method_max_layers_0,-1,-2,-3,-4,"
    "-5/best/best_DataParallel_f_2024-01-25-11-58-42-576046 = stacked-2-mdeberta-v3-base
        - EL: mGenre finetuned on (all with Qids) HIPE data
    """,
    }

create_manifest(config_dict)

Output:
```2024-08-06 13:08:41,137 impresso_commons.versioning.compute_manifest INFO     Validating that the provided configuration has all required arugments.
2024-08-06 13:08:41,137 impresso_commons.versioning.compute_manifest INFO     Provided config validated.
2024-08-06 13:08:41,137 impresso_commons.versioning.compute_manifest INFO     Starting to generate the manifest for DataStage: 'entities'
2024-08-06 13:08:41,137 impresso_commons.versioning.compute_manifest INFO     Fetching the files to consider for titles ['marieclaire']...
2024-08-06 13:08:41,997 impresso_commons.versioning.compute_manifest INFO     Collected a total of 1 files, reading them...
2024-08-06 13:08:41,998 impresso_commons.versioning.compute_manifest INFO     Files identified successfully, initialising the manifest.
2024-08-06 13:08:42,219 impresso_commons.versioning.data_manifest INFO     DataManifest for entities stage successfully initialized.
2024-08-06 13:08:42,219 impresso_commons.versioning.compute_manifest INFO     ---------- marieclaire ----------
2024-08-06 13:08:42,289 impresso_commons.versioning.compute_manifest INFO     marieclaire - Starting to compute the statistics on the fetched files...
Traceback (most recent call last):
  File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_core.py", line 467, in __getattr__
    return object.__getattribute__(self, key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.conda/envs/myenv/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_expr.py", line 496, in _meta
    return self.operation(*args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/dask/utils.py", line 1241, in __call__
    return getattr(__obj, self.method)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/pandas/core/frame.py", line 9846, in explode
    result = df[columns[0]].explode()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/pandas/core/series.py", line 4550, in explode
    values, counts = self._values._explode()
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py", line 1788, in _explode
    if not pa.types.is_list(self.dtype.pyarrow_dtype):
                            ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'StringDtype' object has no attribute 'pyarrow_dtype'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eboros/data/data/eboros-data/projects/impresso-semantic-enrichment-deployment/generate_manifest.py", line 88, in <module>
    create_manifest(config_dict)
  File "/home/eboros/.local/lib/python3.11/site-packages/impresso_commons/versioning/compute_manifest.py", line 271, in create_manifest
    computed_stats = compute_stats_for_stage(processed_files, stage, client)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/impresso_commons/versioning/compute_manifest.py", line 152, in compute_stats_for_stage
    return compute_stats_in_entities_bag(files_bag, client=client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/impresso_commons/versioning/helpers.py", line 944, in compute_stats_in_entities_bag
    .explode("ne_entities")
     ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_collection.py", line 3246, in explode
    return new_collection(expr.ExplodeFrame(self, column=column))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_collection.py", line 4764, in new_collection
    meta = expr._meta
           ^^^^^^^^^^
  File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_core.py", line 472, in __getattr__
    raise RuntimeError(
RuntimeError: Failed to generate metadata for ExplodeFrame(frame=FromGraph(749d18c), column=['ne_entities']). This operation may not be supported by the current backend.
  1. Second try (but this one I probably didn't understand how to use it)
manifest = DataManifest(
    data_stage="entities",  # DataStage.PASSIM also accepted
    s3_output_bucket="42-processed-data-final/entities/entities_v1-0-3",  # includes partition within bucket
    s3_input_bucket="22-rebuilt-final",  # includes partition within bucket
    git_repo="../impresso-semantic-enrichment-deployment",
    temp_dir="../impresso-semantic-enrichment-deployment/temp",
    staging=True,  # If True, will be pushed to 'staging' branch of impresso-data-release, else 'master'
    is_patch=False,
    previous_mft_path=None,  # a manifest already exists on S3 inside "32-passim-rebuilt-final/passim"
    notes="""First NER/EL models: 2024-02-01, 
        - NER: hipe2020_model-stacked_release_2024-01-24-mdeberta_num_layers-2_attn_type_adatrans_n_heads-12_head_dims"
    "-128_pos_embed_sin_trans_dropout_0.45_fc_dropout0.4_pool_method_max_layers_0,-1,-2,-3,-4,"
    "-5/best/best_DataParallel_f_2024-01-25-11-58-42-576046 = stacked-2-mdeberta-v3-base
        - EL: mGenre finetuned on (all with Qids) HIPE data
    """,
)

This didn't do anything.

EmanuelaBoros commented 2 months ago

Would it crash if: {"id":"marieclaire-1944-01-01-a-i0020","ts":"2024-08-02T12:42:13Z","sys_id":"stacked-2-mdeberta-v3-base|mgenre","nes":[]}?

piconti commented 2 months ago

Ok I see, thank you.

It looks like it's the .explode("ne_entities") here is not working for some reason.

Would it crash if: {"id":"marieclaire-1944-01-01-a-i0020","ts":"2024-08-02T12:42:13Z","sys_id":"stacked-2-mdeberta-v3-base|mgenre","nes":[]}?

I'd have to check more precisely, but yes it's completely possible, but I'm not sure:

"ne_entities": sorted(
                    list(set([m["wkd_id"] for m in ci["nes"] if m["wkd_id"] != "NIL"]))
                ),  # sorted list to ensure all are the same

Here the result would just be an empty list I think, but I don't know how "explode" reacts to empty lists.

As for the second try, it's normal that it didn't do anything, this is just the initialization, all the "filling in" of adding the counts etc is done in create_manifest(), but should be done by hand if you initialize it yourself.

piconti commented 2 months ago

From this, it seems that one can add specifically the option collapse_empty=True for row corresponding to empty lists to disappear, but I don't think it means that it would raise an exception otherwise.

The doc signals it would return NaN otherwise, which is probably what did cause the exception. I'll try it out with some examples and update you

EmanuelaBoros commented 2 months ago
  1. I just removed ne_entities and it worked a bit more, but crashed in something else.

  2. I think it was unclear for me, can this be added to the Readme along with the import for DataManifest?

EmanuelaBoros commented 2 months ago

collapse_empty=True doesn't seem to be implemented yet.

I'll try some things on my side and I'll let you know.

EmanuelaBoros commented 2 months ago

It worked with:

    def extract_ne_entities(ci):
        nes = ci.get("nes", [])
        if not isinstance(nes, list):
            nes = []
        ne_entities = sorted(
            #list(set(m["wkd_id"] for m in nes if "wkd_id" in m and m["wkd_id"] != "NIL"))
            list(set(m["wkd_id"] for m in nes if "wkd_id" in m and m["wkd_id"] not in ["NIL", None]))
        )
        return ne_entities

    count_df = (
        s3_entities.map(
            lambda ci: {
                "np_id": ci["id"].split("-")[0],
                "year": ci["id"].split("-")[1],
                "issues": "-".join(ci["id"].split("-")[:-1]),
                "content_items_out": 1,
                "ne_mentions": len(ci["nes"]),
                "ne_entities": extract_ne_entities(ci) # sorted(
            }
        )
        .to_dataframe(
            meta={
                "np_id": str,
                "year": str,
                "issues": str,
                "content_items_out": int,
                "ne_mentions": int,
                "ne_entities": object,
            }
        )
    )

    count_df['ne_entities'] = count_df['ne_entities'].apply(lambda x: x if isinstance(x, list) else [x])
    count_df = count_df.explode("ne_entities").persist()

Also, the file that calls compute needs to have the guard if __name__ == "__main__":

piconti commented 2 months ago
  1. I just removed ne_entities and it worked a bit more, but crashed in something else

Removed from where? I'm not sure I understand.

I think it was unclear for me, can this be added to the Readme along with the import for DataManifest?

I'll add the line used to import DataManifest, that's a good idea, but I think the difference is explained in the readme.
The section Computing a manifest - compute_manifest.py script describes the use of the script (where create_manifest is called), and the section Computing a manifest on the fly during a process describes the instantiation and filling of the manifest. The thing is that you're doing kind of an "in between" because you're doing the script version ("self computes"), but calling create_manifest directly from your code. I can add the specification that this option is also possible, but I didn't want to make it even more confusing.

collapse_empty=True doesn't seem to be implemented yet. you're right -.- if only!

Oh nice that you found a solution! I think that the secret was maybe in the None in the filtered options. However, are there cases where nes is not a list?

Also, the file that calls compute needs to have the guard if name == "main":

What do you mean? that the script calling create_manifest also needs to have if __name__ == "__main__":? what should come after, nothing?

piconti commented 2 months ago

I think by only changing this it works

"ne_entities": sorted(
                    list(
                        set(
                            [
                                m["wkd_id"]
                                for m in ci["nes"]
                                if "wkd_id" in m and m["wkd_id"] not in ["NIL", None]
                            ]
                        )
                    )
                ),  # sorted list to ensure all are the same

I'll try on sligthly more data in two secs

EmanuelaBoros commented 2 months ago

For me, it only works with:

    def extract_ne_entities(ci):
        nes = ci.get("nes", [])
        if not isinstance(nes, list):
            nes = []
        ne_entities = sorted(
            list(set(m["wkd_id"] for m in nes if "wkd_id" in m and m["wkd_id"] not in ["NIL", None]))
        )
        return ne_entities

    count_df = (
        s3_entities.map(
            lambda ci: {
                "np_id": ci["id"].split("-")[0],
                "year": ci["id"].split("-")[1],
                "issues": "-".join(ci["id"].split("-")[:-1]),
                "content_items_out": 1,
                "ne_mentions": len(ci["nes"]),
                #"ne_entities": sorted(
                #     list(set(m["wkd_id"] for m in ci.get("nes", []) if m["wkd_id"] != "NIL"))
                # ),  # sorted list to ensure all are the same
                "ne_entities": extract_ne_entities(ci) # sorted(
                #    list(set(m["wkd_id"] for m in ci.get("nes", []) if m["wkd_id"] != "NIL"))
                #) if ci.get("nes") else []  # sorted list to ensure all are the same
                #"ne_entities": sorted(
                #    list(set([m["wkd_id"] for m in ci["nes"] if m["wkd_id"] != "NIL"]))
                #),  # sorted list to ensure all are the same
            }
        )
        .to_dataframe(
            meta={
                "np_id": str,
                "year": str,
                "issues": str,
                "content_items_out": int,
                "ne_mentions": int,
                "ne_entities": object,
            }
        )
        #.explode("ne_entities")
        #.persist()
    )

    # it works only with this check of []
    count_df['ne_entities'] = count_df['ne_entities'].apply(lambda x: x if isinstance(x, list) else [x])
    count_df = count_df.explode("ne_entities").persist()

it does not work without: count_df['ne_entities'] = count_df['ne_entities'].apply(lambda x: x if isinstance(x, list) else [x])