stac-utils / stac-index

A service that lists all available and registered STAC catalogs and APIs.
https://stacindex.org
Apache License 2.0
7 stars 4 forks source link

Refactor checkDuplicates function #34

Closed zacharyDez closed 10 months ago

zacharyDez commented 10 months ago

Description

This PR modifies the checkDuplicates function to clean the URL and title strings by removing whitespaces, underscores, and hyphens before calculating the Levenshtein distance. Additionally, the function now only flags exact duplicates (Levenshtein distance = 0).

Steps for Testing:

Preliminary: Insert pystac into Database

  1. Database Setup: Ensure that your local database is set up and modify the connection in common.js to match your database specs.
  2. Insert Sample Data: Insert a sample record (pystac) into your public.ecosystem table:

    INSERT INTO public.ecosystem (
        url, title, summary, categories, email, extensions, api_extensions, created, updated, language
    )
    VALUES (
        'https://github.com/stac-utils/pystac', 'PySTAC', 'Sample Summary', ARRAY['Data Creation', 'Validation'], NULL, ARRAY['eo'], ARRAY[], '2022-01-01T00:00:00.000Z', '2022-01-01T00:00:00.000Z', 'Python'
    );

Step 1: Confirm Successful Insert of pgstac

  1. Run the Server: Navigate to the server directory and execute npm install followed by npm run dev.
  2. Test with CURL: Run the following CURL command to insert the new record pgstac:

    curl -X POST 'http://localhost:9999/add' \
    -H 'Content-Type: application/json' \
    -d '{
        "type": "ecosystem",
        "url": "https://github.com/stac-utils/pgstac",
        "slug": "pgstac",
        "title": "PgSTAC",
        ...
    }'

    This should successfully insert the record without flagging it as a duplicate.

Step 2: Validate Cleaning Function

  1. Test with Modified CURL: Run the following CURL command with modified URL and title:

    curl -X POST 'http://localhost:9999/add' \
    -H 'Content-Type: application/json' \
    -d '{
        "type": "ecosystem",
        "url": "https://github.com/stac-utils/p_g-stac",
        "slug": "p_g-stac",
        "title": "P g-STAC",
        ...
    }'

    The URL and title are intentionally filled with whitespaces, underscores, and hyphens. The function should clean these characters and flag the record as a duplicate, given the existing pgstac entry.

  2. Check for Error: If the cleaning function is working correctly, this CURL command should trigger a duplicate error.

zacharyDez commented 10 months ago

Fixes #33

m-mohr commented 10 months ago

Thanks for the PR, left one comment above.

zacharyDez commented 10 months ago

@m-mohr, let me know if you want any other changes and if I can help with deployment. I'll submit pgstac afterwards.

m-mohr commented 10 months ago

Thanks, please also remove the Levensthein import and dependency, I think it's not used anywhere else. Afterwards this is ready for deployment. I'll take care of deployment, thanks.

zacharyDez commented 10 months ago

Ah, I should've caught that! Just fixed @m-mohr. Thanks!

m-mohr commented 10 months ago

Thanks, you did not yet remove the dependency from the package.json, right?

zacharyDez commented 10 months ago

Still getting used to node development... removed dependency!

m-mohr commented 10 months ago

Thanks, merged. I'll redeploy in the next days and let you know once completed.

m-mohr commented 10 months ago

@zacharyDez I just redeployed STAC Index. You should now be able to submit your missing entries.