Closed EmanuelaBoros closed 2 months ago
The line from impresso_commons.versioning.compute_manifest import create_manifest
works ok on my side (python v3.11.5 and python v3.11.9), and I didn't move the import.
Are you sure that impresso_commons is correctly installed?
Locally I have opencv-python v 4.9.0.80
so this might be the problem.
Maybe if you install first opencv-python and then pycommons?
I indeed need to change this requirement anyways
pip install impresso_commons
Requirement already satisfied: impresso_commons in /home/eboros/.local/lib/python3.11/site-packages (1.0.2)
pip freeze | grep opencv-python
opencv-python==4.10.0.84
python
Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from impresso_commons.versioning.compute_manifest import create_manifest
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'impresso_commons.versioning.compute_manifest'
>>>
This happens on the cluster and on my machine.
ll /home/eboros/.local/lib/python3.11/site-packages/impresso_commons/versioning/
total 20
drwxr-xr-x 3 eboros DHLAB-unit 91 Jul 24 16:02 ./
drwxr-xr-x 11 eboros DHLAB-unit 4096 Jul 24 16:02 ../
drwxr-xr-x 2 eboros DHLAB-unit 94 Jul 24 16:02 __pycache__/
-rw-r--r-- 1 eboros DHLAB-unit 7778 Jul 24 16:02 manifest_0.py
-rw-r--r-- 1 eboros DHLAB-unit 5718 Jul 24 16:02 rebuilt_manifest_0.py
I see, I guess it does not install correctly. I will git pull
and try like this.
The current version is 1.1.0, so yes it seems there is something wrong in the import, does it work by doing upgrade
?
config_dict = {
"data_stage": "entities",
"output_bucket": "42-processed-data-final/entities/entities_v1-0-3",
"input_bucket": "22-rebuilt-final",
"git_repository": "../impresso-semantic-enrichment-deployment/",
"newspapers": ["marieclaire"],
"temp_directory": "../impresso-semantic-enrichment-deployment/temp",
"previous_mft_s3_path": "",
"is_staging": True,
"is_patch": False,
"patched_fields": [],
"push_to_git": False,
"file_extensions": "jsonl.bz2",
"log_file": "/local/path/to/log_file.log",
"notes": """First NER/EL models: 2024-02-01,
- NER: hipe2020_model-stacked_release_2024-01-24-mdeberta_num_layers-2_attn_type_adatrans_n_heads-12_head_dims"
"-128_pos_embed_sin_trans_dropout_0.45_fc_dropout0.4_pool_method_max_layers_0,-1,-2,-3,-4,"
"-5/best/best_DataParallel_f_2024-01-25-11-58-42-576046 = stacked-2-mdeberta-v3-base
- EL: mGenre finetuned on (all with Qids) HIPE data
""",
}
create_manifest(config_dict)
Output:
```2024-08-06 13:08:41,137 impresso_commons.versioning.compute_manifest INFO Validating that the provided configuration has all required arugments.
2024-08-06 13:08:41,137 impresso_commons.versioning.compute_manifest INFO Provided config validated.
2024-08-06 13:08:41,137 impresso_commons.versioning.compute_manifest INFO Starting to generate the manifest for DataStage: 'entities'
2024-08-06 13:08:41,137 impresso_commons.versioning.compute_manifest INFO Fetching the files to consider for titles ['marieclaire']...
2024-08-06 13:08:41,997 impresso_commons.versioning.compute_manifest INFO Collected a total of 1 files, reading them...
2024-08-06 13:08:41,998 impresso_commons.versioning.compute_manifest INFO Files identified successfully, initialising the manifest.
2024-08-06 13:08:42,219 impresso_commons.versioning.data_manifest INFO DataManifest for entities stage successfully initialized.
2024-08-06 13:08:42,219 impresso_commons.versioning.compute_manifest INFO ---------- marieclaire ----------
2024-08-06 13:08:42,289 impresso_commons.versioning.compute_manifest INFO marieclaire - Starting to compute the statistics on the fetched files...
Traceback (most recent call last):
File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_core.py", line 467, in __getattr__
return object.__getattribute__(self, key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.conda/envs/myenv/lib/python3.11/functools.py", line 1001, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_expr.py", line 496, in _meta
return self.operation(*args, **self._kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/dask/utils.py", line 1241, in __call__
return getattr(__obj, self.method)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/pandas/core/frame.py", line 9846, in explode
result = df[columns[0]].explode()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/pandas/core/series.py", line 4550, in explode
values, counts = self._values._explode()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py", line 1788, in _explode
if not pa.types.is_list(self.dtype.pyarrow_dtype):
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'StringDtype' object has no attribute 'pyarrow_dtype'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/eboros/data/data/eboros-data/projects/impresso-semantic-enrichment-deployment/generate_manifest.py", line 88, in <module>
create_manifest(config_dict)
File "/home/eboros/.local/lib/python3.11/site-packages/impresso_commons/versioning/compute_manifest.py", line 271, in create_manifest
computed_stats = compute_stats_for_stage(processed_files, stage, client)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/impresso_commons/versioning/compute_manifest.py", line 152, in compute_stats_for_stage
return compute_stats_in_entities_bag(files_bag, client=client)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/impresso_commons/versioning/helpers.py", line 944, in compute_stats_in_entities_bag
.explode("ne_entities")
^^^^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_collection.py", line 3246, in explode
return new_collection(expr.ExplodeFrame(self, column=column))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_collection.py", line 4764, in new_collection
meta = expr._meta
^^^^^^^^^^
File "/home/eboros/.local/lib/python3.11/site-packages/dask_expr/_core.py", line 472, in __getattr__
raise RuntimeError(
RuntimeError: Failed to generate metadata for ExplodeFrame(frame=FromGraph(749d18c), column=['ne_entities']). This operation may not be supported by the current backend.
manifest = DataManifest(
data_stage="entities", # DataStage.PASSIM also accepted
s3_output_bucket="42-processed-data-final/entities/entities_v1-0-3", # includes partition within bucket
s3_input_bucket="22-rebuilt-final", # includes partition within bucket
git_repo="../impresso-semantic-enrichment-deployment",
temp_dir="../impresso-semantic-enrichment-deployment/temp",
staging=True, # If True, will be pushed to 'staging' branch of impresso-data-release, else 'master'
is_patch=False,
previous_mft_path=None, # a manifest already exists on S3 inside "32-passim-rebuilt-final/passim"
notes="""First NER/EL models: 2024-02-01,
- NER: hipe2020_model-stacked_release_2024-01-24-mdeberta_num_layers-2_attn_type_adatrans_n_heads-12_head_dims"
"-128_pos_embed_sin_trans_dropout_0.45_fc_dropout0.4_pool_method_max_layers_0,-1,-2,-3,-4,"
"-5/best/best_DataParallel_f_2024-01-25-11-58-42-576046 = stacked-2-mdeberta-v3-base
- EL: mGenre finetuned on (all with Qids) HIPE data
""",
)
This didn't do anything.
Would it crash if: {"id":"marieclaire-1944-01-01-a-i0020","ts":"2024-08-02T12:42:13Z","sys_id":"stacked-2-mdeberta-v3-base|mgenre","nes":[]}
?
Ok I see, thank you.
It looks like it's the .explode("ne_entities")
here is not working for some reason.
Would it crash if: {"id":"marieclaire-1944-01-01-a-i0020","ts":"2024-08-02T12:42:13Z","sys_id":"stacked-2-mdeberta-v3-base|mgenre","nes":[]}?
I'd have to check more precisely, but yes it's completely possible, but I'm not sure:
"ne_entities": sorted(
list(set([m["wkd_id"] for m in ci["nes"] if m["wkd_id"] != "NIL"]))
), # sorted list to ensure all are the same
Here the result would just be an empty list I think, but I don't know how "explode" reacts to empty lists.
As for the second try, it's normal that it didn't do anything, this is just the initialization, all the "filling in" of adding the counts etc is done in create_manifest()
, but should be done by hand if you initialize it yourself.
From this, it seems that one can add specifically the option collapse_empty
=True for row corresponding to empty lists to disappear, but I don't think it means that it would raise an exception otherwise.
The doc signals it would return NaN otherwise, which is probably what did cause the exception. I'll try it out with some examples and update you
I just removed ne_entities
and it worked a bit more, but crashed in something else.
I think it was unclear for me, can this be added to the Readme along with the import
for DataManifest
?
collapse_empty=True
doesn't seem to be implemented yet.
I'll try some things on my side and I'll let you know.
It worked with:
def extract_ne_entities(ci):
nes = ci.get("nes", [])
if not isinstance(nes, list):
nes = []
ne_entities = sorted(
#list(set(m["wkd_id"] for m in nes if "wkd_id" in m and m["wkd_id"] != "NIL"))
list(set(m["wkd_id"] for m in nes if "wkd_id" in m and m["wkd_id"] not in ["NIL", None]))
)
return ne_entities
count_df = (
s3_entities.map(
lambda ci: {
"np_id": ci["id"].split("-")[0],
"year": ci["id"].split("-")[1],
"issues": "-".join(ci["id"].split("-")[:-1]),
"content_items_out": 1,
"ne_mentions": len(ci["nes"]),
"ne_entities": extract_ne_entities(ci) # sorted(
}
)
.to_dataframe(
meta={
"np_id": str,
"year": str,
"issues": str,
"content_items_out": int,
"ne_mentions": int,
"ne_entities": object,
}
)
)
count_df['ne_entities'] = count_df['ne_entities'].apply(lambda x: x if isinstance(x, list) else [x])
count_df = count_df.explode("ne_entities").persist()
Also, the file that calls compute
needs to have the guard if __name__ == "__main__":
- I just removed ne_entities and it worked a bit more, but crashed in something else
Removed from where? I'm not sure I understand.
I think it was unclear for me, can this be added to the Readme along with the import for DataManifest?
I'll add the line used to import DataManifest, that's a good idea, but I think the difference is explained in the readme.
The section Computing a manifest - compute_manifest.py script
describes the use of the script (where create_manifest
is called), and the section Computing a manifest on the fly during a process
describes the instantiation and filling of the manifest.
The thing is that you're doing kind of an "in between" because you're doing the script version ("self computes"), but calling create_manifest
directly from your code.
I can add the specification that this option is also possible, but I didn't want to make it even more confusing.
collapse_empty=True doesn't seem to be implemented yet. you're right -.- if only!
Oh nice that you found a solution! I think that the secret was maybe in the None
in the filtered options.
However, are there cases where nes
is not a list?
Also, the file that calls compute needs to have the guard if name == "main":
What do you mean?
that the script calling create_manifest
also needs to have if __name__ == "__main__":
? what should come after, nothing?
I think by only changing this it works
"ne_entities": sorted(
list(
set(
[
m["wkd_id"]
for m in ci["nes"]
if "wkd_id" in m and m["wkd_id"] not in ["NIL", None]
]
)
)
), # sorted list to ensure all are the same
I'll try on sligthly more data in two secs
For me, it only works with:
def extract_ne_entities(ci):
nes = ci.get("nes", [])
if not isinstance(nes, list):
nes = []
ne_entities = sorted(
list(set(m["wkd_id"] for m in nes if "wkd_id" in m and m["wkd_id"] not in ["NIL", None]))
)
return ne_entities
count_df = (
s3_entities.map(
lambda ci: {
"np_id": ci["id"].split("-")[0],
"year": ci["id"].split("-")[1],
"issues": "-".join(ci["id"].split("-")[:-1]),
"content_items_out": 1,
"ne_mentions": len(ci["nes"]),
#"ne_entities": sorted(
# list(set(m["wkd_id"] for m in ci.get("nes", []) if m["wkd_id"] != "NIL"))
# ), # sorted list to ensure all are the same
"ne_entities": extract_ne_entities(ci) # sorted(
# list(set(m["wkd_id"] for m in ci.get("nes", []) if m["wkd_id"] != "NIL"))
#) if ci.get("nes") else [] # sorted list to ensure all are the same
#"ne_entities": sorted(
# list(set([m["wkd_id"] for m in ci["nes"] if m["wkd_id"] != "NIL"]))
#), # sorted list to ensure all are the same
}
)
.to_dataframe(
meta={
"np_id": str,
"year": str,
"issues": str,
"content_items_out": int,
"ne_mentions": int,
"ne_entities": object,
}
)
#.explode("ne_entities")
#.persist()
)
# it works only with this check of []
count_df['ne_entities'] = count_df['ne_entities'].apply(lambda x: x if isinstance(x, list) else [x])
count_df = count_df.explode("ne_entities").persist()
it does not work without:
count_df['ne_entities'] = count_df['ne_entities'].apply(lambda x: x if isinstance(x, list) else [x])
Hello,
I am having trouble generating the manifest.
Python version:
My usage:
Maybe the import has changed. I could not find in the documentation where is the new location.
Thanks