spdx / tools-python

A Python library to parse, validate and create SPDX documents.
http://spdx.org
Apache License 2.0
188 stars 134 forks source link

Validation performance extremely poor #818

Open billie-alsup opened 3 months ago

billie-alsup commented 3 months ago

I am finding that validation is extremely slow (taking 3+ hours to validate a document that took a fraction of the time to create). A sample cProfile run shows

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
        1    0.000    0.000 15303.455 15303.455 document_validator.py:19(validate_full_spdx_document)
   288826   87.452    0.000 15125.126    0.052 spdx_id_validators.py:46(validate_spdx_id)
        1    0.623    0.623 15122.608 15122.608 relationship_validator.py:12(validate_relationships)
   107391    3.714    0.000 15121.905    0.141 relationship_validator.py:22(validate_relationship)
   213108  589.004    0.003 15032.507    0.071 spdx_id_validators.py:25(is_spdx_id_present_in_document)
   213109   88.149    0.000 14443.590    0.068 spdx_id_validators.py:31(get_list_of_all_spdx_ids)
   213110 3420.272    0.016 14003.702    0.066 document_utils.py:11(get_contained_spdx_element_ids)
17340975234 4410.662    0.000 11361.884    0.000 dataclass_with_properties.py:46(get_field)
17343871659 6951.929    0.000 6951.929    0.000 {built-in method builtins.getattr}

The implementation seems to always use sequential scanning over arrays when performing validation, when it might more sense to create an index for a particular array, and use that during validation. Is there any plan to improve performance, at least the validation performance? There the necessary indices for speed improvement could be created dynamically and used during validation. Having such indices available while building the document would be fine as well, although would put some restrictions on how the document is updated. I realize that the arrays are providing a strict ordering that should probably be maintained, so it is not as simple as replacing the arrays with a dictionary.

In the example profile, the function get_list_of_all_spdx_ids is called every time is_spdx_id_present_in_document is invoked. This is a prime example where generating an index at the start of validation and using it during validation would be a win. Perhaps we could add some private members to Document class as the index in this case?

def is_spdx_id_present_in_files(spdx_id: str, files: List[File]) -> bool:
    return spdx_id in [file.spdx_id for file in files]

def is_spdx_id_present_in_document(spdx_id: str, document: Document) -> bool:
    all_spdx_ids_in_document: List[str] = get_list_of_all_spdx_ids(document)

    return spdx_id in all_spdx_ids_in_document

def get_list_of_all_spdx_ids(document: Document) -> List[str]:
    all_spdx_ids_in_document: List[str] = [document.creation_info.spdx_id]
    all_spdx_ids_in_document.extend(get_contained_spdx_element_ids(document))

    return all_spdx_ids_in_document

def is_external_doc_ref_present_in_document(external_ref_id: str, document: Document) -> bool:
    all_external_doc_ref_ids_in_document = [
        external_doc_ref.document_ref_id for external_doc_ref in document.creation_info.external_document_refs
    ]

    return external_ref_id in all_external_doc_ref_ids_in_document
jspeed-meyers commented 3 months ago

Seems similar to https://github.com/spdx/tools-python/issues/790

armintaenzertng commented 3 months ago

Hi @billie-alsup, thanks for using the spdx-tools extensively enough to notice such issues! It seems you already spent a lot of thought on the matter, so could I interest you in opening or contributing to a PR to fix this?

As @jspeed-meyers pointed out, this has already been raised but not yet fixed. There are two PRs open currently (#792 and #800) that could benefit from your input.

billie-alsup commented 3 months ago

I am currently hot patching the code with new functions that build a couple indices (python dictionaries) at the start of the validation, and use those to check for existence, rather than sequential scans through arrays. I don't know if this is the best approach, as opposed to maintaining the dictionaries automatically, but it seemed like the safest approach. I can look into contributing such changes, but I will need to get formal approval from our company legal team first. I should be able to comment on PRs in advance of approval though.

billie-alsup commented 2 months ago

I can start contributing now. I do not seem to have permissions to assign issues to myself though. I thought I would start with a really simple change to introduce myself to the workflow. #819 should be quite simple for example.

armintaenzertng commented 2 months ago

Hi @billie-alsup, glad to hear that! :) I'd be happy to assign you to any issue you like, but I see you closed #819 already. Feel free to work on issues even without being assigned to them. The probability to clash with somebody else already working on it is rather low (sadly). It's probably good practice still to mention in the issue that you are looking into it.