data - Githubissues

Is this the same data as was used in the paper "A Survey on Behavior Recognition Using WiFi Channel State Information"? The methods used in that paper were sketchy. Have you used identical methods?

yes. i am using the same data.

Thanks for your reply. FYI randomising time series data into train/validation sets is invalid. I understand the original paper you based this on also made this mistake. With 800 msec overlap between records almost every validation sample has at least one record in the training set that is 80% identical. Splits should be done on a per subject basis. I am notifying the original authors and journal as well. Thank you for making this code public as hopefully this will help others in this field.

the data i have downloaded is in .csv file and not having any values in them( they are showing some random values rather than amplitude or phase values)

@Tsardoz thanks for sharing your suggestions with me.

@du7092 as I remember, the first half part of each line in csv files is amplitude info and the rest part is phase info

@ludlows I am having trouble opening the dataset . as you can see below , when i open the dataset there are no values present . Can you please help me to open the dataset and explain what are the row and coloumn values .

@du7092 as I remember, the first half part of each line in csv files is amplitude info and the rest part is phase info

“That means the original data was extracted in parallel from 90 antennas,” right?

Thanks for your reply. FYI randomising time series data into train/validation sets is invalid. I understand the original paper you based this on also made this mistake. With 800 msec overlap between records almost every validation sample has at least one record in the training set that is 80% identical. Splits should be done on a per subject basis. I am notifying the original authors and journal as well. Thank you for making this code public as hopefully this will help others in this field.

When sliding the window to extract data, the window length and step length should be set the same to ensure that the extracted data does not overlap. Should data with no activity be classified separately for training to help the model understand background noise and data characteristics in a normal state?

No. The data is still not independent even if windows do not overlap. You just cannot do this. Ideally the data should be split so that no subject (person) overlaps training/test/validation sets as the data is not independent then either. Read any text on this.

On Tue, Jun 11, 2024 at 7:34 PM joekerrXie @.***> wrote:

Thanks for your reply. FYI randomising time series data into train/validation sets is invalid. I understand the original paper you based this on also made this mistake. With 800 msec overlap between records almost every validation sample has at least one record in the training set that is 80% identical. Splits should be done on a per subject basis. I am notifying the original authors and journal as well. Thank you for making this code public as hopefully this will help others in this field.

When sliding the window to extract data, the window length and step length should be set the same to ensure that the extracted data does not overlap. Should data with no activity be classified separately for training to help the model understand background noise and data characteristics in a normal state?

— Reply to this email directly, view it on GitHub https://github.com/ludlows/CSI-Activity-Recognition/issues/3#issuecomment-2160256525, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU . You are receiving this because you were mentioned.Message ID: @.***>

Here are three:

"Machine Learning for Time-Series with Python" by Ben Auffarth: This book provides a comprehensive introduction to time series analysis and emphasizes the importance of maintaining the temporal order of data to avoid breaking the independence rule, which is essential for accurate model training and evaluation https://www.amazon.com/Machine-Learning-Time-Python-state/dp/1801819629 .
"A Course in Time Series Analysis" by Suhasini Subba Rao: This textbook covers various aspects of time series analysis, including the importance of preserving the sequence of data points. It explains that randomizing time series data can disrupt the inherent temporal dependencies, leading to misleading results https://web.stat.tamu.edu/~suhasini/teaching673/time_series.pdf.
"Comparing Statistical and Machine Learning Methods for Time Series Forecasting" by Ricardo P. Masini, Marcelo C. Medeiros, and Eduardo F. Mendes: This journal article discusses the application of machine learning methods to time series forecasting and highlights the necessity of keeping the data in its original order to maintain the temporal dependencies crucial for accurate predictions https://anson.ucdavis.edu/~rmasini/files/papers/MMM-2021-JES.pdf.

On Tue, Jun 11, 2024 at 7:39 PM Andrew Walsh @.***> wrote:

No. The data is still not independent even if windows do not overlap. You just cannot do this. Ideally the data should be split so that no subject (person) overlaps training/test/validation sets as the data is not independent then either. Read any text on this.

On Tue, Jun 11, 2024 at 7:34 PM joekerrXie @.***> wrote:

Thanks for your reply. FYI randomising time series data into train/validation sets is invalid. I understand the original paper you based this on also made this mistake. With 800 msec overlap between records almost every validation sample has at least one record in the training set that is 80% identical. Splits should be done on a per subject basis. I am notifying the original authors and journal as well. Thank you for making this code public as hopefully this will help others in this field.

When sliding the window to extract data, the window length and step length should be set the same to ensure that the extracted data does not overlap. Should data with no activity be classified separately for training to help the model understand background noise and data characteristics in a normal state?

— Reply to this email directly, view it on GitHub https://github.com/ludlows/CSI-Activity-Recognition/issues/3#issuecomment-2160256525, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU . You are receiving this because you were mentioned.Message ID: @.***>

No. The data is still not independent even if windows do not overlap. You just cannot do this. Ideally the data should be split so that no subject (person) overlaps training/test/validation sets as the data is not independent then either. Read any text on this. … On Tue, Jun 11, 2024 at 7:34 PM joekerrXie @.> wrote: Thanks for your reply. FYI randomising time series data into train/validation sets is invalid. I understand the original paper you based this on also made this mistake. With 800 msec overlap between records almost every validation sample has at least one record in the training set that is 80% identical. Splits should be done on a per subject basis. I am notifying the original authors and journal as well. Thank you for making this code public as hopefully this will help others in this field. When sliding the window to extract data, the window length and step length should be set the same to ensure that the extracted data does not overlap. Should data with no activity be classified separately for training to help the model understand background noise and data characteristics in a normal state? — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU . You are receiving this because you were mentioned.Message ID: @.>

Thank you for sharing it ,So the best way to ensure the independence of the dataset is to distinguish based on the subjects? According to this rule, it is evident that using this method of data splitting leads to model overfitting, which is why the test accuracy is so high. Do you have any recommended open-source datasets? I find it a bit challenging to find complete datasets on GitHub.

Subject wise splitting is definitely the best way. In university studies (ie. most published ones) there are only a few subjects so this usually leads to really poor results (which is probably why nobody does this). You can also keep the time series nature intact and put (say) first half of each experiment into training, then split the remainder into test and validation. This is still not ideal as the data is not independent but far better than randomising everything, which is cheating (unintentionally or otherwise). Very few datasets are available on internet. This was the only one I could find but I stopped looking very soon after this. Honestly I think this whole field is vaporware like cold fusion. A lot of papers published about nothing. If there was anything in it we would have seen commercial devices by now. Espressif have a demo showing you can detect movement in a room and I think that will be about the extent of it. Many technologies can do this though.

https://www.hackster.io/news/espressif-shows-off-sensorless-esp-wifi-csi-radar-human-occupancy-activity-solution-909bf970a8e6

If you are cynical (like me) you might question why there are no publicly available datasets and software. And no systems you can buy.

I wrote a Medium article: @.***/researchers-misrepresenting-the-capability-of-human-pose-estimation-from-wifi-channel-strength-4ec4d2f871a4?sk=8904bfff93502326db6af6b632bfe8c7

My suggestion would be to look at another topic if you want to do anything meaningful.

On Tue, Jun 11, 2024 at 8:18 PM joekerrXie @.***> wrote:

No. The data is still not independent even if windows do not overlap. You just cannot do this. Ideally the data should be split so that no subject (person) overlaps training/test/validation sets as the data is not independent then either. Read any text on this. … <#m-6123287684221165531> On Tue, Jun 11, 2024 at 7:34 PM joekerrXie @.> wrote: Thanks for your reply. FYI randomising time series data into train/validation sets is invalid. I understand the original paper you based this on also made this mistake. With 800 msec overlap between records almost every validation sample has at least one record in the training set that is 80% identical. Splits should be done on a per subject basis. I am notifying the original authors and journal as well. Thank you for making this code public as hopefully this will help others in this field. When sliding the window to extract data, the window length and step length should be set the same to ensure that the extracted data does not overlap. Should data with no activity be classified separately for training to help the model understand background noise and data characteristics in a normal state? — Reply to this email directly, view it on GitHub <#3 (comment) https://github.com/ludlows/CSI-Activity-Recognition/issues/3#issuecomment-2160256525>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU . You are receiving this because you were mentioned.Message ID: @.>

Thank you for sharing it ,So the best way to ensure the independence of the dataset is to distinguish based on the subjects? According to this rule, it is evident that using this method of data splitting leads to model overfitting, which is why the test accuracy is so high. Do you have any recommended open-source datasets? I find it a bit challenging to find complete datasets on GitHub.

— Reply to this email directly, view it on GitHub https://github.com/ludlows/CSI-Activity-Recognition/issues/3#issuecomment-2160373343, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDAVIME23LZ2PSVLV7LZG3FJDAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGM3TGMZUGM . You are receiving this because you were mentioned.Message ID: @.***>

Subject wise splitting is definitely the best way. In university studies (ie. most published ones) there are only a few subjects so this usually leads to really poor results (which is probably why nobody does this). You can also keep the time series nature intact and put (say) first half of each experiment into training, then split the remainder into test and validation. This is still not ideal as the data is not independent but far better than randomising everything, which is cheating (unintentionally or otherwise). Very few datasets are available on internet. This was the only one I could find but I stopped looking very soon after this. Honestly I think this whole field is vaporware like cold fusion. A lot of papers published about nothing. If there was anything in it we would have seen commercial devices by now. Espressif have a demo showing you can detect movement in a room and I think that will be about the extent of it. Many technologies can do this though. https://www.hackster.io/news/espressif-shows-off-sensorless-esp-wifi-csi-radar-human-occupancy-activity-solution-909bf970a8e6 If you are cynical (like me) you might question why there are no publicly available datasets and software. And no systems you can buy. I wrote a Medium article: @./researchers-misrepresenting-the-capability-of-human-pose-estimation-from-wifi-channel-strength-4ec4d2f871a4?sk=8904bfff93502326db6af6b632bfe8c7 My suggestion would be to look at another topic if you want to do anything meaningful. … On Tue, Jun 11, 2024 at 8:18 PM joekerrXie @.> wrote: No. The data is still not independent even if windows do not overlap. You just cannot do this. Ideally the data should be split so that no subject (person) overlaps training/test/validation sets as the data is not independent then either. Read any text on this. … <#m-6123287684221165531> On Tue, Jun 11, 2024 at 7:34 PM joekerrXie @.> wrote: Thanks for your reply. FYI randomising time series data into train/validation sets is invalid. I understand the original paper you based this on also made this mistake. With 800 msec overlap between records almost every validation sample has at least one record in the training set that is 80% identical. Splits should be done on a per subject basis. I am notifying the original authors and journal as well. Thank you for making this code public as hopefully this will help others in this field. When sliding the window to extract data, the window length and step length should be set the same to ensure that the extracted data does not overlap. Should data with no activity be classified separately for training to help the model understand background noise and data characteristics in a normal state? — Reply to this email directly, view it on GitHub <#3 (comment) <#3 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU . You are receiving this because you were mentioned.Message ID: @.> Thank you for sharing it ,So the best way to ensure the independence of the dataset is to distinguish based on the subjects? According to this rule, it is evident that using this method of data splitting leads to model overfitting, which is why the test accuracy is so high. Do you have any recommended open-source datasets? I find it a bit challenging to find complete datasets on GitHub. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDAVIME23LZ2PSVLV7LZG3FJDAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGM3TGMZUGM . You are receiving this because you were mentioned.Message ID: @.***>

yes ,you are right , In fact, action recognition based on WiFi signals started a long time ago, from RSSI to CSI. Unfortunately, most of the models in the papers are only effective on the current test data, and the data in most of the papers is not traceable.

Subject wise splitting is definitely the best way. In university studies (ie. most published ones) there are only a few subjects so this usually leads to really poor results (which is probably why nobody does this). You can also keep the time series nature intact and put (say) first half of each experiment into training, then split the remainder into test and validation. This is still not ideal as the data is not independent but far better than randomising everything, which is cheating (unintentionally or otherwise). Very few datasets are available on internet. This was the only one I could find but I stopped looking very soon after this. Honestly I think this whole field is vaporware like cold fusion. A lot of papers published about nothing. If there was anything in it we would have seen commercial devices by now. Espressif have a demo showing you can detect movement in a room and I think that will be about the extent of it. Many technologies can do this though. https://www.hackster.io/news/espressif-shows-off-sensorless-esp-wifi-csi-radar-human-occupancy-activity-solution-909bf970a8e6 If you are cynical (like me) you might question why there are no publicly available datasets and software. And no systems you can buy. I wrote a Medium article: @./researchers-misrepresenting-the-capability-of-human-pose-estimation-from-wifi-channel-strength-4ec4d2f871a4?sk=8904bfff93502326db6af6b632bfe8c7 My suggestion would be to look at another topic if you want to do anything meaningful. … On Tue, Jun 11, 2024 at 8:18 PM joekerrXie @.> wrote: No. The data is still not independent even if windows do not overlap. You just cannot do this. Ideally the data should be split so that no subject (person) overlaps training/test/validation sets as the data is not independent then either. Read any text on this. … <#m-6123287684221165531> On Tue, Jun 11, 2024 at 7:34 PM joekerrXie @.> wrote: Thanks for your reply. FYI randomising time series data into train/validation sets is invalid. I understand the original paper you based this on also made this mistake. With 800 msec overlap between records almost every validation sample has at least one record in the training set that is 80% identical. Splits should be done on a per subject basis. I am notifying the original authors and journal as well. Thank you for making this code public as hopefully this will help others in this field. When sliding the window to extract data, the window length and step length should be set the same to ensure that the extracted data does not overlap. Should data with no activity be classified separately for training to help the model understand background noise and data characteristics in a normal state? — Reply to this email directly, view it on GitHub <#3 (comment) <#3 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU . You are receiving this because you were mentioned.Message ID: @.> Thank you for sharing it ,So the best way to ensure the independence of the dataset is to distinguish based on the subjects? According to this rule, it is evident that using this method of data splitting leads to model overfitting, which is why the test accuracy is so high. Do you have any recommended open-source datasets? I find it a bit challenging to find complete datasets on GitHub. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDAVIME23LZ2PSVLV7LZG3FJDAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGM3TGMZUGM . You are receiving this because you were mentioned.Message ID: @.***>

Separating the data sequence by time and training with the same model, the best prediction result is less than 50%, making it difficult to achieve the results shown in the homepage. Is my result the same as yours? @Tsardoz

I cannot find the original analysis I did but have copied the results from the confusion matrix in the Medium article.

[image: image.png]

Since there are 7 different activities there is a 1 in 7 chance of guessing the correct one (0.14). The other major flaw in the study is the lack of a null class. If it is going to be used for anything useful where people are not restricted to these activities the results would be even worse. I have used video for fall detection and although I get sensitivities/specificities of better than 0.95 using YOLO models this is still not good enough for practical use, due to the number of false positives one gets from all sorts of things eg. pets, piles of clothes etc. when cameras are on 24/7. I think this CSI stuff is interesting but it is plagued by a lack of transparency. I honestly cannot see it ever being useful in a clinical setting. I don't think YOLO - style models are either for that matter.

I should emphasise this is not even my code. I just used the code from here and changed the splits to subject wise. I am a bit surprised it is still there.

https://github.com/ermongroup/Wifi_Activity_Recognition

On Wed, Jun 26, 2024 at 7:06 PM joekerrXie @.***> wrote:

Subject wise splitting is definitely the best way. In university studies (ie. most published ones) there are only a few subjects so this usually leads to really poor results (which is probably why nobody does this). You can also keep the time series nature intact and put (say) first half of each experiment into training, then split the remainder into test and validation. This is still not ideal as the data is not independent but far better than randomising everything, which is cheating (unintentionally or otherwise). Very few datasets are available on internet. This was the only one I could find but I stopped looking very soon after this. Honestly I think this whole field is vaporware like cold fusion. A lot of papers published about nothing. If there was anything in it we would have seen commercial devices by now. Espressif have a demo showing you can detect movement in a room and I think that will be about the extent of it. Many technologies can do this though. https://www.hackster.io/news/espressif-shows-off-sensorless-esp-wifi-csi-radar-human-occupancy-activity-solution-909bf970a8e6 If you are cynical (like me) you might question why there are no publicly available datasets and software. And no systems you can buy. I wrote a Medium article: @.

/researchers-misrepresenting-the-capability-of-human-pose-estimation-from-wifi-channel-strength-4ec4d2f871a4?sk=8904bfff93502326db6af6b632bfe8c7 My suggestion would be to look at another topic if you want to do anything meaningful. … <#m-3393136924954606741> On Tue, Jun 11, 2024 at 8:18 PM joekerrXie @.> wrote: No. The data is still not independent even if windows do not overlap. You just cannot do this. Ideally the data should be split so that no subject (person) overlaps training/test/validation sets as the data is not independent then either. Read any text on this. … <#m-6123287684221165531> On Tue, Jun 11, 2024 at 7:34 PM joekerrXie @.> wrote: Thanks for your reply. FYI randomising time series data into train/validation sets is invalid. I understand the original paper you based this on also made this mistake. With 800 msec overlap between records almost every validation sample has at least one record in the training set that is 80% identical. Splits should be done on a per subject basis. I am notifying the original authors and journal as well. Thank you for making this code public as hopefully this will help others in this field. When sliding the window to extract data, the window length and step length should be set the same to ensure that the extracted data does not overlap. Should data with no activity be classified separately for training to help the model understand background noise and data characteristics in a normal state? — Reply to this email directly, view it on GitHub <#3 https://github.com/ludlows/CSI-Activity-Recognition/issues/3 (comment) <#3 (comment) https://github.com/ludlows/CSI-Activity-Recognition/issues/3#issuecomment-2160256525>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU https://github.com/notifications/unsubscribe-auth/ACOIPDFM6BRN6EDINZP3KX3ZG3ABFAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGI2TMNJSGU . You are receiving this because you were mentioned.Message ID: @.> Thank you for sharing it ,So the best way to ensure the independence of the dataset is to distinguish based on the subjects? According to this rule, it is evident that using this method of data splitting leads to model overfitting, which is why the test accuracy is so high. Do you have any recommended open-source datasets? I find it a bit challenging to find complete datasets on GitHub. — Reply to this email directly, view it on GitHub <#3 (comment) https://github.com/ludlows/CSI-Activity-Recognition/issues/3#issuecomment-2160373343>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDAVIME23LZ2PSVLV7LZG3FJDAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGM3TGMZUGM . You are receiving this because you were mentioned.Message ID: @.***>

Separating the data sequence by time and training with the same model, the best prediction result is less than 50%, making it difficult to achieve the results shown in the homepage. Is my result the same as yours? @Tsardoz https://github.com/Tsardoz

— Reply to this email directly, view it on GitHub https://github.com/ludlows/CSI-Activity-Recognition/issues/3#issuecomment-2191193217, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIPDD3DXNI4L572Y5KDDLZJKABBAVCNFSM6AAAAABJDXZX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJRGE4TGMRRG4 . You are receiving this because you were mentioned.Message ID: @.***>

ludlows / CSI-Activity-Recognition

data #3