scanoss / engine

SCANOSS Open Source Inventory Engine
GNU General Public License v2.0
34 stars 20 forks source link

[test on V5.3.1] unexpected behaviors when upgrade to V5.3.1 #57

Closed superkaiy closed 7 months ago

superkaiy commented 8 months ago

Hi, Guys. I do some test on the engine V5.3.1.

scan file KB comment
1.py file1 no modified, just rename
2.py file2 partial content of file2
3.py file1, file2 mixed with file1 and file2

about V5.3.0, the result is expected as above but about V5.3.1, the result is unexpected. the scan result about 2.c and 3.c is that they matched nothing output_5.3.0.txt output_5.3.1.txt test_case.zip

mscasso-scanoss commented 7 months ago

Hi @superkaiy, thank you for reporting this issue. Attached, you will find the response from our servers. Unfortunately, we are using different knowledge bases, so I'm unable to reproduce your issue. However, I will be happy to assist you in identifying the root of this problem and confirming whether there is a bug or not.

Please run the 'scanoss' command for each file, adding "-q" at the end, to obtain the debug information, and share it with me. Please perform this for both v5.3.0 and v5.3.1. I will be eagerly awaiting your response to proceed.

Best regards, Mariano test_output.txt

superkaiy commented 7 months ago

Hi @mscasso-scanoss , my knowledge bases contain two components as following:

  1. repo: https://github.com/dpkp/kafka-python.git; branch: master; revision: 5bb126bf20bbb5baeb4e9afc48008dbe411631bc Screenshot from 2023-09-25 20-22-19

  2. repo: https://github.com/lencx/ChatGPT.git; branch: main; revision: de5c8f0f8770c0e836d808bede4ac50427611ff5 Screenshot from 2023-09-25 20-19-25

debug information: debug_infor_v5.3.0.txt debug_infor_v5.3.1.txt

mscasso-scanoss commented 7 months ago

Hi! sorry for the delay, please test the last version and close the issue if you think the problem is solved. Best regards, Mariano

superkaiy commented 7 months ago

Hello @mscasso-scanoss , I test the latest version, and the match percent is ~25% even if the file has been modified very slightly.

In the above scenario, the match accuracy from the latest version is obviously not as expected. but the match accuracy from the version V5.3.0 is as expected. You can reference to the modification as following for the difference between the two versions: https://github.com/scanoss/engine/blob/4a801c981cc15ac319465aafdf1d7b990b4e3d58/src/snippets.c#L673

mscasso-scanoss commented 7 months ago

Hello, @superkaiy, thank you so much for your feedback. I will keep it in mind for the next release. However, it's important to note that the engine, based on the winnowing algorithm, isn't designed to produce precise line ranges. The matching percentage is also an approximation and may not always accurately reflect reality. The snippet matching concept aims to assist users in identifying possible matches and snippets, but confirming the exact range requires manual validation or additional analysis, such as HPSM.

I understand that you obtained more accurate ranges with the previous engine version in the cases you tested. However, the latest version is actually yielding better results in our extensive test dataset. Please feel free to open a pull request with your change over the current state of main branch, and I will test it.

Additionally, we are currently hosting a workshop in Madrid to discuss match accuracy. If you'd like to participate in a virtual meeting, please send me an email at mariano.scasso@scanoss.com. Your presence would be highly valuable to us.

superkaiy commented 7 months ago

Hello @mscasso-scanoss , Thanks for your patience. Just as you said, the snippet matching concept is just a means of assistance, the difference in result may be due to different test datasets. Maybe HPSM can meet the requirements which need exact range, for more other data will be generated during scanning runtime