iDrDex / star-django

New STAR app
7 stars 1 forks source link

STARGEO.org/stats #71

Open iDrDex opened 7 years ago

iDrDex commented 7 years ago

We need the above URL with basic stats in one place.

Two basic sets stats: 1) Reference, and 2) Generated.

1) Reference stats are counts for everything we already keep track of and in sync from GEO: Series, Samples, Platform, Probes, including some more that we need to count: PMID. I'd like to see some cumulative graphical distribution (across species maybe) like: https://www.ncbi.nlm.nih.gov/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=3531084_gks1193f1p.jpg

2) Generated stats are counts of everything we generate through the STARGEO.org front end. Users, Tags, Annotations, etc. We need cumulative graphing abilities too here.

ir4y commented 7 years ago

@idrdex Would you like to replace current 2d plots

screen shot 2017-06-01 at 16 17 50

with something like this http://bl.ocks.org/camio/5087116 ?

ir4y commented 7 years ago

What is PMID @idrdex ? How can we count it?

ir4y commented 7 years ago

The fist set of graphics will show common system information Each graphic will have only one line which represents the total amount of items

The second set of graphics will show distribution by species. Each graphic will have button which switches one line representation to multiline representation (each species will have its own line)

The third set of graphics will show distribution by approving. Each graphic will have button which switches one line representation to two representation (amount of approved and rejected items)

The fourth set of graphics will show distribution by user contribution. For each user will be a line with the count of contributed tags.

P.S. For annotations (both Series and Sample) there are three possible options to show

ir4y commented 7 years ago

Approved SampleTag is a SampleTag which has relation with SampleAnnotation with best_cohens_kappa Approved SerieTag is a SerieTag which has agreed flag set to True

Suor commented 7 years ago

Filtering by actuality (not deleted)

SeriesTag: is_active
SampleTag: is_active
SerieValidation: not ignored and not by_incompetent
SampleValidation: # by its serie_validation

Selecting concordant/non-concordant and validated/not-validated

SerieAnnotation: best_cohens_kappa == 1
SampleAnnotation: # by serie_annotation, can't be invalidated separately from it
SeriesTag: agreed is not None
SampleTag: # by its series_tag
SerieValidation: best_kappa == 1
SampleValidation: concordant or serie_validation.best_kappa == 1

Also, for recreating history SeriesTag has several events:

Suor commented 7 years ago

Also, when first SeriesTag-SerieValidation or SerieValidation-SerieValidation match appears everything else in that group becomes invalid.

iDrDex commented 7 years ago

Hi all. Any updates on this issue? I noticed that stargeo.org/stats is still the user statistics and only available to super users. I suggest we rename this current page to stargeo.org/users and make a new stargeo.org/stats page with counts like I mentioned in the initial post for this issue. The stats page must support this claim in the paper that is about to be published:

'To date, over 21,000 PubMed publications have been derived from over 1,000,000 digital samples (see http://STARGEO.org/stats)...'

ir4y commented 7 years ago

Hi @idrdex

When the project started, there was no model to store counters of project items. So, the first part of this task was to restore this data. This part of work is finished.

Now we have all data to create a graphics and you can see them here. It's a set of very simple graphics, it only displays raw data from the database.

My current task is to prettify this graphics and UI. I will group them by types. When it is ready I will change url of this page to stargeo.org/stats and add it to main menu.

ir4y commented 7 years ago

Hi @idrdex @Suor I have released the first version of the graphics. http://stargeo.org/stats/ What do you think about the result?

iDrDex commented 7 years ago

Awesome. Thx. Can you add a graph for PMID cumulative distribution? Just plot the sum total of unique PMIDs over time. This is most important for the paper.

Sent from my iPhone

On Jul 28, 2017, at 12:44 AM, Ilya Beda notifications@github.com wrote:

Hi @idrdex @Suor I have released the first version of the graphics. http://stargeo.org/stats/ What do you think about the result?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ir4y commented 7 years ago

What is PMID?

пт, 28 июля 2017 г., 18:18 idrdex notifications@github.com:

Awesome. Thx. Can you add a graph for PMID cumulative distribution? Just plot the sum total of unique PMIDs over time. This is most important for the paper.

Sent from my iPhone

On Jul 28, 2017, at 12:44 AM, Ilya Beda notifications@github.com wrote:

Hi @idrdex @Suor I have released the first version of the graphics. http://stargeo.org/stats/ What do you think about the result?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/idrdex/star-django/issues/71#issuecomment-318627103, or mute the thread https://github.com/notifications/unsubscribe-auth/ABa4V_ZPcl7FZKKVlkW5zVFcr2cby5dSks5sScNngaJpZM4Niu8_ .

iDrDex commented 7 years ago

PMID is an associated PubMed publication that derived from the Series data. It would map to a given Series. For instance, see https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51808 as an example Series with associated publication referenced by PMID=24981333 (https://www.ncbi.nlm.nih.gov/pubmed/24981333). I know PMID is in the tables somewhere as I can query it from the JSON that is stored on Postgres. We should probably show PMID in the search results for every Series returned and link to pubmed just like GEO does. @Suor may want to chime in.

iDrDex commented 7 years ago

@ir4y @Suor we need a cumulative graphic of unique PMID counts ASAP. This code will generate a stargeo data frame that you can use to plot it:

def parse_url(url, params):
    #http://stackoverflow.com/questions/2506379/add-params-to-given-url-in-python
    import urlparse
    from urllib import urlencode
    url_parts = list(urlparse.urlparse(url))
    query = dict(urlparse.parse_qsl(url_parts[4]))
    query.update(params)
    url_parts[4] = urlencode(query)
    return urlparse.urlunparse(url_parts)

def query_api(url = 'http://stargeo.org/api/serie_annotations/', limit=1000):
    import requests
    url = parse_url(url, dict(limit=limit))
    while True:
        response = requests.get(url).json()
        for result in response['results']:
            yield result
        if not response['next']:
            break
        url = response['next']

def query_df(url = 'http://stargeo.org/api/serie_annotations/', attrs=True):
    import pandas as pd
    df = pd.DataFrame(query_api(url))
    if attrs:
        df = expand_attrs(df)
    return df

def expand_attrs(df):
    import pandas as pd
    if 'attrs' in df:
        attrs = pd.DataFrame(dict(attr) for attr in df.attrs)
        df = df.drop('attrs', 1)
        return df.join(attrs)
    return df

def read_stargeo():
    import sys
    print "Querying STARGEO.org...",
    sys.stdout.flush()
    stargeo = query_df('http://stargeo.org/api/series/')\
        .sort('samples_count',
                      ascending=False)\
                .set_index('id')
    stargeo.index.name = 'series_id'
    print len(stargeo.index), "records done!"
    return stargeo

stargeo = read_stargeo()
from itertools import chain
items = [ids.split("|\n|") for ids in stargeo.pubmed_id.drop_duplicates().dropna()]
stargeo_pmid = set(chain(*items))
print len(stargeo), 'records'
print len(stargeo_pmid), 'distinct PMIDs'
Suor commented 7 years ago

Ok, Ilya is on it.

1 сент. 2017 г. 12:30 пользователь "idrdex" notifications@github.com написал:

@ir4y https://github.com/ir4y @Suor https://github.com/suor we need a cumulative graphic of unique PMID counts ASAP.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/idrdex/star-django/issues/71#issuecomment-326492493, or mute the thread https://github.com/notifications/unsubscribe-auth/AARVx5MOl3PvYzyf7EwRdjxEecP1FosUks5sd5Z2gaJpZM4Niu8_ .

ir4y commented 7 years ago

@idrdex I have pushed graphic for distinct PMID distributed by dates. http://localhost:8000/stats/

BTW What is the priority for other graphics. They are not updating now. I am planing to finish this task after SkinIQ. Is it OK?

iDrDex commented 7 years ago

Sure. PMIDs was critical as we are about to publish the paper. We should reorganize the tabs as well on the stats page, but lets revisit after melanoma app is delivered. We need to focus on a full stargeo redesign honestly. But melanoma is highest priority now. Thx.